Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data schema #18

Open
aditya6396 opened this issue Jan 14, 2025 · 0 comments
Open

Data schema #18

aditya6396 opened this issue Jan 14, 2025 · 0 comments

Comments

@aditya6396
Copy link

this ymal.file

name: "pretraining_data"
parquet_path:
s3: "wiki_data"
source_column: "text_sentences_sonar_emb"
source_text_column: "text_sentences

my data saved after the download

wiki_data folder

which contain the file -------------0_b0ddbee86cdf7d47_0_0.parquet

after it used the command this

the command you give
python scripts/fit_embedding_normalizer.py --ds dataset1:4 dataset2:1 dataset3:10 --save_path "path/to/new/normalizer.pt" --max_nb_samples 1000000

my dataset schema of the 0_b0ddbee86cdf7d47_0_0.parquet

import pyarrow as pa

schema = pa.schema([
("id", pa.int64()),
("url", pa.string()),
("text_sentences_sonar_emb", pa.list_(pa.list_(pa.float32()))),
])

i making the normal.pt file will be according to my ymal file

uv run python scripts/fit_embedding_normalizer.py --ds pretraining_data:1 --save_path "/home/cpatwadityasharma/lcm/large_concept_model/output/normalizer.pt" --max_nb_samples 1000000

the error i mentation in above

i also folow these step
training_data:

name: "pretraining_data"
source_suffix_text: "End of text."
validation_data:

name: "some_other_separate_validation_data"
source_suffix_text: "End of text."
provided me the appropriate for it base lcm traing how to done it solution how to done it .thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant