Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

52 GB scan arrays create NUMA issues #25

Closed
bennahugo opened this issue Jun 24, 2019 · 13 comments
Closed

52 GB scan arrays create NUMA issues #25

bennahugo opened this issue Jun 24, 2019 · 13 comments

Comments

@bennahugo
Copy link
Collaborator

@sjperkins master now goes nowhere fast.

tricolour - 2019-06-21 16:27:18,449 INFO - Searching /home/hugo/.tricolour
tricolour - 2019-06-21 16:27:18,449 INFO - Found static mask file /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask
tricolour - 2019-06-21 16:27:18,451 INFO - Loaded mask /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask (non-dilated) with 41.50% flagged bandwidth between 0.856 and 1.712 GHz
tricolour - 2019-06-21 16:27:22,960 INFO - Adding field 'J0825-5010' scan 1 to compute graph for processing
/home/hugo/tricolour/tricolour/packing.py:212: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=np.bool)
/home/hugo/tricolour/tricolour/packing.py:220: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=data.dtype)
/home/hugo/tricolour/tricolour/packing.py:227: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flags.dtype)
/home/hugo/tricolour/tricolour/packing.py:271: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flag_windows.dtype)
tricolour - 2019-06-21 16:27:24,200 INFO - Adding field 'J0825-5010' scan 2 to compute graph for processing
tricolour - 2019-06-21 16:27:25,457 INFO - Adding field 'J0825-5010' scan 3 to compute graph for processing
tricolour - 2019-06-21 16:27:26,689 INFO - Adding field 'J0825-5010' scan 4 to compute graph for processing
tricolour - 2019-06-21 16:27:27,939 INFO - Adding field 'J0825-5010' scan 5 to compute graph for processing
tricolour - 2019-06-21 16:27:29,123 INFO - Adding field 'J0825-5010' scan 6 to compute graph for processing
tricolour - 2019-06-21 16:27:30,307 INFO - Adding field 'J0825-5010' scan 7 to compute graph for processing
tricolour - 2019-06-21 16:27:31,527 INFO - Adding field 'J0825-5010' scan 8 to compute graph for processing
tricolour - 2019-06-21 16:27:32,761 INFO - Adding field 'J0825-5010' scan 9 to compute graph for processing
tricolour - 2019-06-21 16:27:34,009 INFO - Adding field 'J0825-5010' scan 10 to compute graph for processing
/home/hugo/J0825/venv/local/lib/python2.7/site-packages/dask/blockwise.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  contains = index in indices
[##############                          ] | 37% Completed | 65hr 43min 58.6s^[[B^[[B^[[B
tricolour - 2019-06-21 16:27:18,449 INFO - Searching /home/hugo/.tricolour
tricolour - 2019-06-21 16:27:18,449 INFO - Found static mask file /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask
tricolour - 2019-06-21 16:27:18,451 INFO - Loaded mask /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask (non-dilated) with 41.50% flagged bandwidth between 0.856 and 1.712 GHz
tricolour - 2019-06-21 16:27:22,960 INFO - Adding field 'J0825-5010' scan 1 to compute graph for processing
/home/hugo/tricolour/tricolour/packing.py:212: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=np.bool)
/home/hugo/tricolour/tricolour/packing.py:220: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=data.dtype)
/home/hugo/tricolour/tricolour/packing.py:227: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flags.dtype)
/home/hugo/tricolour/tricolour/packing.py:271: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flag_windows.dtype)
tricolour - 2019-06-21 16:27:24,200 INFO - Adding field 'J0825-5010' scan 2 to compute graph for processing
tricolour - 2019-06-21 16:27:25,457 INFO - Adding field 'J0825-5010' scan 3 to compute graph for processing
tricolour - 2019-06-21 16:27:26,689 INFO - Adding field 'J0825-5010' scan 4 to compute graph for processing
tricolour - 2019-06-21 16:27:27,939 INFO - Adding field 'J0825-5010' scan 5 to compute graph for processing
tricolour - 2019-06-21 16:27:29,123 INFO - Adding field 'J0825-5010' scan 6 to compute graph for processing
tricolour - 2019-06-21 16:27:30,307 INFO - Adding field 'J0825-5010' scan 7 to compute graph for processing
tricolour - 2019-06-21 16:27:31,527 INFO - Adding field 'J0825-5010' scan 8 to compute graph for processing
tricolour - 2019-06-21 16:27:32,761 INFO - Adding field 'J0825-5010' scan 9 to compute graph for processing
tricolour - 2019-06-21 16:27:34,009 INFO - Adding field 'J0825-5010' scan 10 to compute graph for processing
/home/hugo/J0825/venv/local/lib/python2.7/site-packages/dask/blockwise.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  contains = index in indices
[##############                          ] | 37% Completed | 65hr 43min 58.6s^[[B^[[B^[[B
@sjperkins
Copy link
Member

On a single small 32k scan (32k1934.ms.2min.ms) the difference is not large (this is on stevie whose disk may be contested).

xarray-ms master:

tricolour - 2019-06-24 11:13:47,654 INFO - Data flagged successfully in 00h07m37s

xarray-ms 0.1.6:

tricolour - 2019-06-24 11:25:57,847 INFO - Data flagged successfully in 00h05m51s

I'll see what happens on a larger dataset

@sjperkins
Copy link
Member

sjperkins commented Jun 24, 2019

On my laptop with a 3 scan 4K dataset (1527016443_sdp_l0.full_1284.hh_vv.ms) the difference is neglible:

xarray-ms master:

tricolour - 2019-06-24 10:51:58,933 INFO - Data flagged successfully in 00h10m33s

xarray-ms 0.1.6

tricolour - 2019-06-24 11:51:31,495 INFO - Data flagged successfully in 00h10m31s

@sjperkins
Copy link
Member

Working hypothesis: 52 GB scans are being created which are too big to fit on a single DIMM, leading to unnecessary inter-CPU commuication.

Solution: Allocate per-baseline arrays.

@sjperkins sjperkins changed the title new xarray-ms results in graphs that is significantly slower 52 GB Arrays create NUMA issues Jun 24, 2019
@sjperkins sjperkins changed the title 52 GB Arrays create NUMA issues 52 GB scan arrays create NUMA issues Jun 24, 2019
@sjperkins
Copy link
Member

Spot the non-linearity:

# 7.45 GB
In [11]: %timeit -n 1 -r 1 np.ones(int(1e9), dtype=np.complex64)
1 loop, best of 1: 2.1 s per loop

# 14.9 GB
In [12]: %timeit -n 1 -r 1 np.ones(int(2e9), dtype=np.complex64)
1 loop, best of 1: 4.22 s per loop

# 29.8 GB
In [13]: %timeit -n 1 -r 1 np.ones(int(4e9), dtype=np.complex64)
1 loop, best of 1: 2min 42s per loop

Things seem to get much slower above the 16GB power of 2.

# 16.3GB
In [23]: %timeit -n 1 -r 1 np.ones(int(2.2e9), dtype=np.complex64)
1 loop, best of 1: 49.5 s per loop

@bennahugo
Copy link
Collaborator Author

bennahugo commented Jun 24, 2019 via email

@sjperkins
Copy link
Member

Can't reproduce.

Where can't you reproduce this? The above timings are on stevie where the problem was reported.

@bennahugo
Copy link
Collaborator Author

image

stevie

@bennahugo
Copy link
Collaborator Author

This is the OS default without NUMA pinning.

@sjperkins
Copy link
Member

Are you just doing a malloc np.empty, a zeroed malloc np.zeros or a np.ones?

mallocs and zero'd mallocs are fast, its filling the array where we see the slowdown.

Anyhow, now that the dead Cubical processes have been cleared out on stevie things have improved. I suspect that the Cubical shared memory model was holding onto pages and might have provoked swapping

# 7.45 GB
In [4]: %timeit -r 1 -n 1 np.ones(int(1e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 3.07 s per loop

# 14.9GB
In [5]: %timeit -r 1 -n 1 np.ones(int(2e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 6.24 s per loop

# 29.9 GB
In [6]: %timeit -r 1 -n 1 np.ones(int(4e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 12.2 s per loop

# 60 GB
In [7]: %timeit -r 1 -n 1 np.ones(int(8e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 25.3 s per loop

# 120 GB  non-linear jump here
In [8]: %timeit -r 1 -n 1 np.ones(int(16e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 4min per loop

@sjperkins
Copy link
Member

I made a mistake, allocated with ones and then filled with ones again so duplicated work. Anyway, the allocation and fill is faster now.

@bennahugo
Copy link
Collaborator Author

bennahugo commented Jun 27, 2019 via email

@sjperkins
Copy link
Member

Currently, only a single thread allocates all memory for a window. It's pretty slow.

@sjperkins sjperkins reopened this Jul 2, 2019
@sjperkins
Copy link
Member

See further description of related issues in #36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants