Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No match for FieldRef.Name(split) in id: string #15

Open
nanxue2023 opened this issue Jan 13, 2025 · 4 comments
Open

No match for FieldRef.Name(split) in id: string #15

nanxue2023 opened this issue Jan 13, 2025 · 4 comments

Comments

@nanxue2023
Copy link

Hello together, I get the following error when execute:

No match for FieldRef.Name(split) in id: string 
url: string 
title: string
text: string
text_sentences: list<element: string>
text_sentences_sonar_emb: list<element: fixed_size_list<element: float>[1024]>

I use pdb to print the stack information and find that the program run to line 1087 of parquet_utils.py fragments = list(dataset._dataset.get_fragments(filter=dataset._filter_expression))
I can't fix this. Could u help me to figure out?

@hiskuDN
Copy link

hiskuDN commented Jan 13, 2025

@elbayadm gave a good answer here, maybe it'll help. #9 (comment)

@nanxue2023
Copy link
Author

@hiskuDN Thx!!!! #9 (comment) helps me solve the problem but I meet a new error: 'pyarrow.lib.ListScalar' object has no attribute 'to'. The error appears in line 191 embs = [x.to(self.gang.device).to(dtype) for x in batch[col_name]] in dataloader.py
I replace embs = [x.to(self.gang.device).to(dtype) for x in batch[col_name]] with the following:

batch_py = torch.tensor(batch[col_name].to_pylist())
embs = [x.to(self.gang.device).to(dtype) for x in batch_py]

Another error occurs! error=expected sequence of length 25 at dim 1 (got 26)
😭

@hiskuDN
Copy link

hiskuDN commented Jan 14, 2025

What dataset are you using? Are you pytorch for data handling at any point in the pipeline?

@nanxue2023
Copy link
Author

wikipedia dataset. I use this code to process the data:

import pandas as pd

dataset_path = "/content/large_concept_model/sample_data/0_a25e918a7789ecfa_0_0.parquet" # dataset path 

df = pd.read_parquet(dataset_path) # load dataset in pandas df

df['split'] = 'train' # adding 'split' column to dataset because of the missing split column

df.to_parquet(dataset_path) # convert dataset to parquest again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants