Data schema #18

aditya6396 · 2025-01-14T16:23:59Z

this ymal.file

name: "pretraining_data"
parquet_path:
s3: "wiki_data"
source_column: "text_sentences_sonar_emb"
source_text_column: "text_sentences

my data saved after the download

wiki_data folder

which contain the file -------------0_b0ddbee86cdf7d47_0_0.parquet

after it used the command this

the command you give
python scripts/fit_embedding_normalizer.py --ds dataset1:4 dataset2:1 dataset3:10 --save_path "path/to/new/normalizer.pt" --max_nb_samples 1000000

my dataset schema of the 0_b0ddbee86cdf7d47_0_0.parquet

import pyarrow as pa

schema = pa.schema([
("id", pa.int64()),
("url", pa.string()),
("text_sentences_sonar_emb", pa.list_(pa.list_(pa.float32()))),
])

i making the normal.pt file will be according to my ymal file

uv run python scripts/fit_embedding_normalizer.py --ds pretraining_data:1 --save_path "/home/cpatwadityasharma/lcm/large_concept_model/output/normalizer.pt" --max_nb_samples 1000000

the error i mentation in above

i also folow these step
training_data:

name: "pretraining_data"
source_suffix_text: "End of text."
validation_data:

name: "some_other_separate_validation_data"
source_suffix_text: "End of text."
provided me the appropriate for it base lcm traing how to done it solution how to done it .thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data schema #18

Data schema #18

aditya6396 commented Jan 14, 2025

Data schema #18

Data schema #18

Comments

aditya6396 commented Jan 14, 2025