From b2bf23b1725edd00d528e5d1469d3d1a0fa4dd1f Mon Sep 17 00:00:00 2001 From: nataliaElv Date: Thu, 21 Nov 2024 16:08:19 +0100 Subject: [PATCH] Images & apply review comments --- chapters/en/chapter10/1.mdx | 6 +++--- chapters/en/chapter10/4.mdx | 23 +++++++++++------------ chapters/en/chapter10/5.mdx | 3 +-- chapters/en/chapter10/7.mdx | 4 ++++ 4 files changed, 19 insertions(+), 17 deletions(-) diff --git a/chapters/en/chapter10/1.mdx b/chapters/en/chapter10/1.mdx index f6a1cb788..bfb0e6836 100644 --- a/chapters/en/chapter10/1.mdx +++ b/chapters/en/chapter10/1.mdx @@ -1,6 +1,8 @@ # Introduction to Argilla[[introduction-to-argilla]] -In Chapter 5 you learnt how to build a dataset using the 🤗 Datasets library and in Chapter 6 you explored how to fine-tune models for some common NLP tasks. In this chapter, you will learn how to use Argilla to **curate datasets** that you can use to train and evaluate your models. +In Chapter 5 you learnt how to build a dataset using the 🤗 Datasets library and in Chapter 6 you explored how to fine-tune models for some common NLP tasks. In this chapter, you will learn how to use Argilla to **annotate and curate datasets** that you can use to train and evaluate your models. + +The key to training models that perform well is to have high-quality data. Although there are some good datasets in the Hub that you could use to train and evaluate your models, these may not be relevant for your specific application or use case. In this scenario, you may want to build and curate a dataset of your own. Argilla will help you to do this efficiently. With Argilla you can: @@ -9,8 +11,6 @@ With Argilla you can: - gather **human feedback** for LLMs and multi-modal models. - invite experts to collaborate with you in Argilla, or crowdsource annotations! -The key to training models that perform well is to have high-quality data. Although there are some good datasets in the Hub that you could use to train and evaluate your models, these may not be relevant for your specific application or use case. In this scenario, you may want to build and curate a dataset of your own. Argilla will help you to do this efficiently. - Here are some of the things that you will learn in this chapter: - How to set up your own Argilla instance. diff --git a/chapters/en/chapter10/4.mdx b/chapters/en/chapter10/4.mdx index f675baac0..ea53a2a20 100644 --- a/chapters/en/chapter10/4.mdx +++ b/chapters/en/chapter10/4.mdx @@ -1,9 +1,5 @@ # Annotate your dataset -🚧 WIP 🚧 - -##TODO: Add screenshots! - Now it is time to start working from the Argilla UI to annotate our dataset. ## Align your team with annotation guidelines @@ -12,6 +8,10 @@ Before you start annotating your dataset, it is always good practice to write so In Argilla, you can go to your dataset settings page in the UI and modify the guidelines and the descriptions of your questions to help with alignment. +Screenshot of the Dataset Settings page in Argilla. + +If you want to dive deeper into the topic of how to write good guidelines, we recommend reading [this blogpost](https://argilla.io/blog/annotation-guidelines-practices) and the bibliographical references mentioned there. + ## Distribute the task In the dataset settings page, you can also change the dataset distribution settings. This will help you annotate more efficiently when you're working as part of a team. The default value for the minimum submitted responses is 1, meaning that as soon as a record has 1 submitted response it will be considered complete and count towards the progress in your dataset. @@ -23,7 +23,13 @@ Sometimes, you want to have more than one submitted response per record, for exa >[!TIP] >💡 If you are deploying Argilla in a Hugging Face Space, any team members will be able to log in using the Hugging Face OAuth. Otherwise, you may need to create users for them following [this guide](https://docs.argilla.io/latest/how_to_guides/user/). -When you open your dataset, you will realize that the first question is already filled in with some suggested labels. That's because in the previous section we mapped our question called `label` to the `label_text` column in the dataset, so that we simply need to review and correct the already existing labels. For the token classification, we'll need to add all labels manually, as we didn't include any suggestions. +When you open your dataset, you will realize that the first question is already filled in with some suggested labels. That's because in the previous section we mapped our question called `label` to the `label_text` column in the dataset, so that we simply need to review and correct the already existing labels: + +Screenshot of the dataset in Argilla. + +For the token classification, we'll need to add all labels manually, as we didn't include any suggestions. This is how it might look after the span annotations: + +Screenshot of the dataset in Argilla with spans annotated. As you move through the different records, there are different actions you can take: - submit your responses, once you're done with the record. @@ -31,10 +37,3 @@ As you move through the different records, there are different actions you can t - discard them, if the record souldn't be part of the dataset or you won't give responses to it. In the next section, you will learn how you can export and use those annotations. - ---- -Examples of images from other chapters: - -One-hot encoded labels for question answering. - - \ No newline at end of file diff --git a/chapters/en/chapter10/5.mdx b/chapters/en/chapter10/5.mdx index f18f6605e..d889d51a7 100644 --- a/chapters/en/chapter10/5.mdx +++ b/chapters/en/chapter10/5.mdx @@ -37,8 +37,7 @@ filtered_records = dataset.records(status_filter) ``` >[!TIP] ->⚠️ Note that the records could have more than one response and that each of them can have any status from `submitted`, `draft` or `discarded`. - +>⚠️ Note that the records with `completed` status (i.e., records that meet the minimum submitted responses configured in the task distribution settings) could have more than one response and that each response can have any status from `submitted`, `draft` or `discarded`. Learn more about querying and filtering records in the [Argilla docs](https://docs.argilla.io/latest/how_to_guides/query/). diff --git a/chapters/en/chapter10/7.mdx b/chapters/en/chapter10/7.mdx index c0bf8b3fc..90b923e8b 100644 --- a/chapters/en/chapter10/7.mdx +++ b/chapters/en/chapter10/7.mdx @@ -35,6 +35,10 @@ Let's test what you learned in this chapter! { text: "Train your model", explain: "You cannot train a model directly in Argilla, but you can use the data you curate in Argilla to train your own model", + }, + { + text: "Generate synthetic datasets", + explain: "To generate synthetic datasets, you can use the distilabel package and then use Argilla to review and curate the generated data.", } ]} />