diff --git a/examples/CCI3-HQ/README.md b/examples/CCI3-HQ/README.md index a512824f..a47fc832 100644 --- a/examples/CCI3-HQ/README.md +++ b/examples/CCI3-HQ/README.md @@ -8,12 +8,10 @@ To improve the quality of Chinese corpora, we followed [Fineweb-edu's](https://h ## Annotation We used Qwen2-72B-Instruct to score 145,000 pairs of web samples and their scores from 0 to 5, generated by Qwen2. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational. -The prompt used for annotation mostly reuses [FineWeb-edu prompt](./prompt.txt). - +The prompt used for annotation mostly reuses [FineWeb-edu prompt](./prompt.txt). You can use [qwen2_api](./qwen2_api.py) to request an already deployed API (such as one using vLLM) for annotation. ## Classifier training -The classifier was trained on We added a classification head with a single regression output to [BGE-M3](https://huggingface.co/BAAI/bge-m3) and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head and dropout was not used. The model achieved an F1 score of 73% when converted to a binary classifier using a score threshold of 3. - +The classifier was trained on We added a classification head with a single regression output to [BGE-M3](https://huggingface.co/BAAI/bge-m3) and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head and dropout was not used. The model achieved an F1 score of 73% when converted to a binary classifier using a score threshold of 3. [Training script](./run_classification_trainval.sh) is provided here. The classifier is available at: https://huggingface.co/BAAI/cci3-hq-classifier