-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LCM_MSE eval fails with cnn_dailymail prepared parquet due to missing keys #19
Comments
Updated run command with dataset.source_column and dataset.target_column yields further errors:
|
@jamesdhope : There is a patch #21 (we haven't merged this yet due to some bugs in the third party lib stopes that failed the CI). Do you want to give this a try ? |
@antoine-tran Reproduced the same issue with the patch merged locally in #21 , re-running prepare, embed and eval. Eval fails with the error below. This appears to be a separate issue affecting LCM eval.
Please note the example run command for the two tower LCM does not include the source_column and target_column flags however the script fails earlier on if these are not supplied for the base LCM. Also the original issue may have been resolved with the dataset flags although I can't be sure. parquet columns are:
|
Hi all! Could you please elaborate on how you acquired the model checkpoints used for evaluation? Did you conduct the entire training process independently, or did Meta provide pre-trained checkpoints for public access? Thank you in advance! |
@YujiaHu0819 I completed the pre-training step with the Wikipedia data as per the readme.md instructions to obtain the model checkpoints. Please note that I did not complete the fine tuning step. |
Hey, I also encounter this error too in the LCM evaluation part. My training code is: !CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes=1 --nproc-per-node=1 -m large_concept_model.lcm.train \
launcher=standalone +pretrain=mse \
++trainer.data_loading_config.max_tokens=512 \
++trainer.output_dir="/content/drive/MyDrive/LCM/checkpoints/mse_lcm" \
+trainer.use_submitit=false Then I downloaded LCM evaluation parquet file with: # eval for LCM
!uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
--predictor base_lcm \
--model_card /content/drive/MyDrive/LCM/checkpoints/mse_lcm/checkpoints/step_10000/model_card.yaml \
--generator_batch_size 16 \
--tasks lcm_generation \
--task_args '{"max_gen_len": 200}' \
--dataset.parquet_path /content/parquet_dataset/cnn_dailymail/lcm_eval.parquet \
--data_loading.batch_size 16 \
--dump_dir content/output_results_lcm and I get following error: [2025-01-18 21:31:16,942] [rank 0] [INFO] Selected task execution: ['lcm_generation'] My LCM Training outputs are attached in logs.txt file. You can see training details. I trained MSE_LCM. Also, my datacards.yaml file below: # FIXME
name: "pretraining_data_train"
parquet_path:
s3: /content/large_concept_model/sample_data/train_data.parquet
source_column: "text_sentences_sonar_emb"
source_text_column: "text_sentences"
---
# FIXME
name: "pretraining_data_val"
parquet_path:
s3: /content/large_concept_model/sample_data/val_data.parquet
source_column: "text_sentences_sonar_emb"
source_text_column: "text_sentences"
---
# FIXME
name: "finetuning_data"
parquet_path:
s3: "cosmopedia_sample"
source_column: prompt_sentences_sonar_emb
source_text_column: prompt_sentences
target_column: text_sentences_sonar_emb
target_text_column: text_sentences
# partition columns:
# "split" (train, validation) |
@hasanyazarr fix for this one is to specify the dataset.source_column and target_column in the run command which is missing from the example in the readme.md, however it will likely fail with the error above complaining about a missing key ☝️ |
@jamesdhope : Checking the code, So in your case it should be:
Current documentation states it wrong that we only need one types of column ( |
That's resolved this issue. @antoine-tran @hasanyazarr do you have rough order of time for the eval script to run? With a max_gen_len of 10 with an L4 GPU I have no raw results file after 3 hours. |
@antoine-tran Do you know a solution for '[rank 0] [WARNING] filtering table whose nb sentences and nb sonar vectors are aligned, keeping 2 rows out of11490' Evaluation code you shared works well but this warning causes to evaluate only 2 rows out of 11490. Full output in below: [2025-01-23 07:35:57,150] [rank 0] [INFO] submitted single job for lcm_generation_base_lcm_a591ec0874_2025-01-23_07-35-57: DEBUG_138828927617776 |
@jamesdhope @hasanyazarr We haven't supported vllm / triton or other optimized inference libraries yet, and the eval lib relies on submitit to parallelize the jobs. If you specify That said, the eval lib is not very optimal. Internally our eval run on cnndm with 5 GPUs took about 15-20 minutes to finish. |
@hasanyazarr : I could not reproduce the issues. Could you try to run the following script (or something similar, I wrote this directly in the comment box without testing) and tell me how many data you got ? import torch
from fairseq2.gang import FakeGang
from from lcm.datasets.configs import ParquetDatasetConfig, EvaluationDataLoadingConfig
from lcm.evaluation.utils.data_utils import ParquetTestDataLoader
dataset = ParquetDatasetConfig(
parquet_path="YOUR Parquet file",
source_column="prompt_sentences_sonar_emb",
source_text_column="prompt_sentences",
target_column="answer_sentences_sonar_emb",
target_text_column= "answer_sentences"
)
data_loading = EvaluationDataLoadingConfig(batch_size=1, seed=23, min_length_of_sequences=1, nb_epochs=1)
data_loader = ParquetTestDataLoader(
data_config=data_loading,
datasets=[dataset],
gang=FakeGang(device=torch.device("cuda:0")),
dtype=torch.float32,
)
cnt = 0
for batch in data_loader.iterate_batches():
cnt += len(batch) |
@antoine-tran The exact code you provided produces an error, but after fixing it, I get cnt = 11490. |
ok, so the dataloading should be good. What is your actual command for evaluation again ? |
!CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
--predictor base_lcm \
--model_card /content/drive/MyDrive/LCM/checkpoints/mse_lcm/checkpoints/step_10000/model_card.yaml \
--launcher standalone \
--dataset.parquet_path /content/drive/MyDrive/LCM/eval_data/0_55ac997a0bfaa427_0_0.parquet \
--dataset.source_column prompt_sentences_sonar_emb \
--dataset.source_text_column prompt_sentences \
--dataset.target_column answer_sentences_sonar_emb \
--dataset.target_text_column prompt_sentences \
--tasks lcm_generation \
--task_args '{"max_gen_len": 200}' \
--data_loading.batch_size 16 --generator_batch_size 16 \
--dump_dir /content/drive/MyDrive/LCM/output_results_lcm \ This code generates that error. I also tried different batch_size values but it always uses 2 rows out of 11490. In my model_card path, I have files: rank_0.pt, metadata.pt, model_card.yaml and model.pt |
Following evaluation instructions to evaluate the pre-trained LCM_MSE on cnn_dailymail parquet data.
Run command
Error:
Investigation
Adding a print(batch.keys) to data_utilis.py reveals keys the iterate_batches is looking for _source_column key
The cnn-dailymail generated parquet columns using the prepare script are:
and cnn_dailymail.py also must be modified with the following:
The text was updated successfully, but these errors were encountered: