-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated and heavily simplified anonymization script #235
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #235 +/- ##
==========================================
+ Coverage 59.63% 64.57% +4.94%
==========================================
Files 29 28 -1
Lines 3793 3498 -295
==========================================
- Hits 2262 2259 -3
+ Misses 1531 1239 -292 ☔ View full report in Codecov by Sentry. |
* Preprocess now returns a list of paths to the temporary parquet files it created * Added type hint --------- Co-authored-by: middd2 <[email protected]>
@yusufuyanik1 would you mind giving this a review? Would like to merge it sometime soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, worked with some sample data i created :)
We've had an anonymization script in the tools for a little bit, but these were not performant enough on any realistic and real loads, so it was time for an update. The configuration options here are much less, but it's much more efficient.
It utilizes a two-pass approach, whereby we first output all files to batched parquet files and then loop over all parquet files to generate one single output parquet file.
Many thanks to @danielm-dk for helping improve this part.