52 GB scan arrays create NUMA issues #25

bennahugo · 2019-06-24T08:16:35Z

@sjperkins master now goes nowhere fast.

tricolour - 2019-06-21 16:27:18,449 INFO - Searching /home/hugo/.tricolour
tricolour - 2019-06-21 16:27:18,449 INFO - Found static mask file /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask
tricolour - 2019-06-21 16:27:18,451 INFO - Loaded mask /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask (non-dilated) with 41.50% flagged bandwidth between 0.856 and 1.712 GHz
tricolour - 2019-06-21 16:27:22,960 INFO - Adding field 'J0825-5010' scan 1 to compute graph for processing
/home/hugo/tricolour/tricolour/packing.py:212: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=np.bool)
/home/hugo/tricolour/tricolour/packing.py:220: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=data.dtype)
/home/hugo/tricolour/tricolour/packing.py:227: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flags.dtype)
/home/hugo/tricolour/tricolour/packing.py:271: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flag_windows.dtype)
tricolour - 2019-06-21 16:27:24,200 INFO - Adding field 'J0825-5010' scan 2 to compute graph for processing
tricolour - 2019-06-21 16:27:25,457 INFO - Adding field 'J0825-5010' scan 3 to compute graph for processing
tricolour - 2019-06-21 16:27:26,689 INFO - Adding field 'J0825-5010' scan 4 to compute graph for processing
tricolour - 2019-06-21 16:27:27,939 INFO - Adding field 'J0825-5010' scan 5 to compute graph for processing
tricolour - 2019-06-21 16:27:29,123 INFO - Adding field 'J0825-5010' scan 6 to compute graph for processing
tricolour - 2019-06-21 16:27:30,307 INFO - Adding field 'J0825-5010' scan 7 to compute graph for processing
tricolour - 2019-06-21 16:27:31,527 INFO - Adding field 'J0825-5010' scan 8 to compute graph for processing
tricolour - 2019-06-21 16:27:32,761 INFO - Adding field 'J0825-5010' scan 9 to compute graph for processing
tricolour - 2019-06-21 16:27:34,009 INFO - Adding field 'J0825-5010' scan 10 to compute graph for processing
/home/hugo/J0825/venv/local/lib/python2.7/site-packages/dask/blockwise.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  contains = index in indices
[##############                          ] | 37% Completed | 65hr 43min 58.6s^[[B^[[B^[[B
tricolour - 2019-06-21 16:27:18,449 INFO - Searching /home/hugo/.tricolour
tricolour - 2019-06-21 16:27:18,449 INFO - Found static mask file /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask
tricolour - 2019-06-21 16:27:18,451 INFO - Loaded mask /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask (non-dilated) with 41.50% flagged bandwidth between 0.856 and 1.712 GHz
tricolour - 2019-06-21 16:27:22,960 INFO - Adding field 'J0825-5010' scan 1 to compute graph for processing
/home/hugo/tricolour/tricolour/packing.py:212: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=np.bool)
/home/hugo/tricolour/tricolour/packing.py:220: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=data.dtype)
/home/hugo/tricolour/tricolour/packing.py:227: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flags.dtype)
/home/hugo/tricolour/tricolour/packing.py:271: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flag_windows.dtype)
tricolour - 2019-06-21 16:27:24,200 INFO - Adding field 'J0825-5010' scan 2 to compute graph for processing
tricolour - 2019-06-21 16:27:25,457 INFO - Adding field 'J0825-5010' scan 3 to compute graph for processing
tricolour - 2019-06-21 16:27:26,689 INFO - Adding field 'J0825-5010' scan 4 to compute graph for processing
tricolour - 2019-06-21 16:27:27,939 INFO - Adding field 'J0825-5010' scan 5 to compute graph for processing
tricolour - 2019-06-21 16:27:29,123 INFO - Adding field 'J0825-5010' scan 6 to compute graph for processing
tricolour - 2019-06-21 16:27:30,307 INFO - Adding field 'J0825-5010' scan 7 to compute graph for processing
tricolour - 2019-06-21 16:27:31,527 INFO - Adding field 'J0825-5010' scan 8 to compute graph for processing
tricolour - 2019-06-21 16:27:32,761 INFO - Adding field 'J0825-5010' scan 9 to compute graph for processing
tricolour - 2019-06-21 16:27:34,009 INFO - Adding field 'J0825-5010' scan 10 to compute graph for processing
/home/hugo/J0825/venv/local/lib/python2.7/site-packages/dask/blockwise.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  contains = index in indices
[##############                          ] | 37% Completed | 65hr 43min 58.6s^[[B^[[B^[[B

The text was updated successfully, but these errors were encountered:

sjperkins · 2019-06-24T09:35:33Z

On a single small 32k scan (32k1934.ms.2min.ms) the difference is not large (this is on stevie whose disk may be contested).

xarray-ms master:

tricolour - 2019-06-24 11:13:47,654 INFO - Data flagged successfully in 00h07m37s

xarray-ms 0.1.6:

tricolour - 2019-06-24 11:25:57,847 INFO - Data flagged successfully in 00h05m51s

I'll see what happens on a larger dataset

sjperkins · 2019-06-24T09:54:26Z

On my laptop with a 3 scan 4K dataset (1527016443_sdp_l0.full_1284.hh_vv.ms) the difference is neglible:

xarray-ms master:

tricolour - 2019-06-24 10:51:58,933 INFO - Data flagged successfully in 00h10m33s

xarray-ms 0.1.6

tricolour - 2019-06-24 11:51:31,495 INFO - Data flagged successfully in 00h10m31s

sjperkins · 2019-06-24T15:04:09Z

Working hypothesis: 52 GB scans are being created which are too big to fit on a single DIMM, leading to unnecessary inter-CPU commuication.

Solution: Allocate per-baseline arrays.

sjperkins · 2019-06-24T15:08:53Z

Spot the non-linearity:

# 7.45 GB
In [11]: %timeit -n 1 -r 1 np.ones(int(1e9), dtype=np.complex64)
1 loop, best of 1: 2.1 s per loop

# 14.9 GB
In [12]: %timeit -n 1 -r 1 np.ones(int(2e9), dtype=np.complex64)
1 loop, best of 1: 4.22 s per loop

# 29.8 GB
In [13]: %timeit -n 1 -r 1 np.ones(int(4e9), dtype=np.complex64)
1 loop, best of 1: 2min 42s per loop

Things seem to get much slower above the 16GB power of 2.

# 16.3GB
In [23]: %timeit -n 1 -r 1 np.ones(int(2.2e9), dtype=np.complex64)
1 loop, best of 1: 49.5 s per loop

bennahugo · 2019-06-24T23:03:35Z

Can't reproduce. See my email. Are you sure it is not just the page caches playing heavoc with you?

…

On Mon, 24 Jun 2019, 17:08 Simon Perkins, ***@***.***> wrote: Spot the non-linearity: # 7.45 GB In [11]: %timeit -n 1 -r 1 np.ones(int(1e9), dtype=np.complex64) 1 loop, best of 1: 2.1 s per loop # 14.9 GB In [12]: %timeit -n 1 -r 1 np.ones(int(2e9), dtype=np.complex64) 1 loop, best of 1: 4.22 s per loop # 29.8 GB In [13]: %timeit -n 1 -r 1 np.ones(int(4e9), dtype=np.complex64) 1 loop, best of 1: 2min 42s per loop Things seem to get much slower above the 16GB power of 2. # 16.3GB In [23]: %timeit -n 1 -r 1 np.ones(int(2.2e9), dtype=np.complex64) 1 loop, best of 1: 49.5 s per loop — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#25>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB4RE6U6VJBYWVYG42V7DXDP4DPQRANCNFSM4H24ETLQ> .

sjperkins · 2019-06-27T17:46:59Z

Can't reproduce.

Where can't you reproduce this? The above timings are on stevie where the problem was reported.

bennahugo · 2019-06-27T20:11:15Z

stevie

bennahugo · 2019-06-27T20:12:46Z

This is the OS default without NUMA pinning.

sjperkins · 2019-06-27T20:29:12Z

Are you just doing a malloc np.empty, a zeroed malloc np.zeros or a np.ones?

mallocs and zero'd mallocs are fast, its filling the array where we see the slowdown.

Anyhow, now that the dead Cubical processes have been cleared out on stevie things have improved. I suspect that the Cubical shared memory model was holding onto pages and might have provoked swapping

# 7.45 GB
In [4]: %timeit -r 1 -n 1 np.ones(int(1e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 3.07 s per loop

# 14.9GB
In [5]: %timeit -r 1 -n 1 np.ones(int(2e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 6.24 s per loop

# 29.9 GB
In [6]: %timeit -r 1 -n 1 np.ones(int(4e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 12.2 s per loop

# 60 GB
In [7]: %timeit -r 1 -n 1 np.ones(int(8e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 25.3 s per loop

# 120 GB  non-linear jump here
In [8]: %timeit -r 1 -n 1 np.ones(int(16e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 4min per loop

sjperkins · 2019-06-27T20:36:44Z

I made a mistake, allocated with ones and then filled with ones again so duplicated work. Anyway, the allocation and fill is faster now.

bennahugo · 2019-06-27T21:48:29Z

Its malloc and memset with np.zeros.

…

On Thu, 27 Jun 2019, 22:36 Simon Perkins, ***@***.***> wrote: I made a mistake, allocated with ones and then filled with ones again so duplicated work. Anyway, the allocation and fill is faster now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#25>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB4RE6Q6YWZ4ZIHPIDKC3G3P4UQF5ANCNFSM4H24ETLQ> .

sjperkins · 2019-07-02T11:45:33Z

Currently, only a single thread allocates all memory for a window. It's pretty slow.

sjperkins · 2019-07-03T08:45:49Z

See further description of related issues in #36

sjperkins changed the title ~~new xarray-ms results in graphs that is significantly slower~~ 52 GB Arrays create NUMA issues Jun 24, 2019

sjperkins changed the title ~~52 GB Arrays create NUMA issues~~ 52 GB scan arrays create NUMA issues Jun 24, 2019

sjperkins mentioned this issue Jun 27, 2019

Greedy scan buffering #26

Closed

sjperkins closed this as completed Jul 1, 2019

sjperkins reopened this Jul 2, 2019

sjperkins closed this as completed Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

52 GB scan arrays create NUMA issues #25

52 GB scan arrays create NUMA issues #25

bennahugo commented Jun 24, 2019

sjperkins commented Jun 24, 2019

sjperkins commented Jun 24, 2019 •

edited

Loading

sjperkins commented Jun 24, 2019

sjperkins commented Jun 24, 2019

bennahugo commented Jun 24, 2019 via email

sjperkins commented Jun 27, 2019

bennahugo commented Jun 27, 2019

bennahugo commented Jun 27, 2019

sjperkins commented Jun 27, 2019

sjperkins commented Jun 27, 2019

bennahugo commented Jun 27, 2019 via email

sjperkins commented Jul 2, 2019

sjperkins commented Jul 3, 2019

52 GB scan arrays create NUMA issues #25

52 GB scan arrays create NUMA issues #25

Comments

bennahugo commented Jun 24, 2019

sjperkins commented Jun 24, 2019

sjperkins commented Jun 24, 2019 • edited Loading

sjperkins commented Jun 24, 2019

sjperkins commented Jun 24, 2019

bennahugo commented Jun 24, 2019 via email

sjperkins commented Jun 27, 2019

bennahugo commented Jun 27, 2019

bennahugo commented Jun 27, 2019

sjperkins commented Jun 27, 2019

sjperkins commented Jun 27, 2019

bennahugo commented Jun 27, 2019 via email

sjperkins commented Jul 2, 2019

sjperkins commented Jul 3, 2019

sjperkins commented Jun 24, 2019 •

edited

Loading