-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make field more robust with None/nan repeated input #757
Conversation
ccl-core
commented
Oct 22, 2024
•
edited
Loading
edited
- Changes in field (more modular, more robust with none/nan values).
- Add test with csv with nan values.
- Add https://huggingface.co/datasets/teknium/OpenHermes-2.5?row=5 as an example of a dataset with repeated Null values for e2e test.
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -0,0 +1,3 @@ | |||
{"default/custom_instruction": "None", "default/topic": "None", "default/model_name": "None", "default/model": "None", "default/skip_prompt_formatting": false, "default/category": "orca", "default/views": null, "default/language": "None", "default/id": "None", "default/title": "None", "default/idx": "None", "default/hash": "None", "default/avatarUrl": "None", "default/system_prompt": "None", "default/source": "airoboros2.2"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are some of the values "None"
and some of the values null
?
@@ -57,7 +57,7 @@ def _apply_transform_fn(value: Any, transform: Transform, field: Field) -> Any: | |||
return pd.Timestamp(value).strftime(transform.format) | |||
else: | |||
raise ValueError(f"`format` only applies to dates. Got {field.data_type}") | |||
elif transform.separator is not None: | |||
elif transform.separator is not None and not _is_na(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the is_na case be handled at the top of the function?
It seems like it could also be a problem for other transform (regex, json_path, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
[{field.id: v} for v in value] if not _is_na(value) else [{field.id: value}] | ||
) | ||
else: | ||
if not _is_na(value) and len(value) != len(result[parent_id]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this function be simplified? Something like:
value = [{field.id: None}] if _is_na(value) else value
existing_value = result.get(parent_id, value)
if not _is_na(value) and len(value) != len(existing_value):
raise ValueError(...)
for i in range(existing_value):
result[parent_id][i][field.id] = None if _is_na(value) else value[i]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified it a bit, hope it goes in the direction you had in mind!
value = ( | ||
[_cast_value(self.node.ctx, v, field.data_type) for v in value] | ||
if not _is_na(value) | ||
else value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this function benefit an early return as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand (we return result, not value) but I changed it to a if..elif..else, it should be more readable now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve to unblock but left a few comments.