-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation on performance tuning exercise #74
Labels
question
Further information is requested
Comments
Following target-type strategy applied:
|
bennahugo
added
question
Further information is requested
and removed
enhancement
New feature or request
labels
Mar 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Just doing a mind dump of this past few days work for posterity here.
I've done careful profiling and tuning for com08 with the following config:
Where the first 24 and the last 24 threads are elementwise colocated on the same physical core as hyper threads per NUMA node. I've assigned affinities as such:
This makes a large difference (30%) on run times. The dask threadpool is thus assigned to these numbers for however large the threadpool becomes (xaxis of plots below).
Memory layout as such. I didn't profile the memory footprint but it kept at 1/5th this size for the most part.
iTLB miss ratios are high but in the grand scheme of data TLB accesses the misses are essentially negligible. What is more important is to tune the number of baselines per block to lower the L3 cache misses as discussed with @bmerry.
I used 112.61 GiB of data, 856 MHz band channelized to 208kHz resolution and dumped at 1s resolution to profile the flagger. Using any less actually starts breaking the strong scaling here. I suspect we start running into compiler / MAD flagger / DASK overheads in this regime. For small (<< 100gib MSv2 datasets (incl. metadata)) the scaling dramatically falls off a cliff. Python profiling with pprofile is inconclusive. I suspect the profiler does not take calls to external non-python libraries into account correctly for instance I'm really suspicious of the very low 0.02ish percent calls to casacore getcol and putcols for sizeable 10s of gib reads!! So I don't think we can trust the callgraph profile output. cprofile does not take threads into account so of limited use although I know it takes c calls into account correctly from DDF profiling. See below for a much smaller (~60gb 1k channelized 8s dumptime dataset). Here we run into weak scaling as mentioned
The text was updated successfully, but these errors were encountered: