Release 0.5 : merging in GPU efforts (#261)

* fixes doc * some changes to import fft * further step to debug * import fft now doesn't use cppimport, compiles to a temporary directory rather than the build directory, and can be imported multiple times with different shapes. Needs a decent tidy up though. * initial refactor. Now subclasses the UnixCCompiler like a real person * Removes the find module- not needed anymore * clean up imports * Moving more to gpu - broken * All elements in place, needs testing * smaller fixes * get rid of unnecessary printing * Executes now with segfault when saving, can't confirm reco * save final result, even if autosave is false (like current master) * Fixing travis build, ignoring all acceleration tests for now * refactor so that the temporary directories are cleaned up * moved tag for cuda tests in correct place * Bugfix: reduce epsilon in fourier update to avoid overflow issues * remove trace of debugging * Added memory to ML_pycuda but it's not used yet. Smaller fixes like LL and photon error now follow original more closely. Found wrong calculation of A1, A2 in polyline coefficient * Added regularizer to ML_pycuda with unity tests * pycuda floating intensities first pass * Nans are gone, executs but differs from ML and ML_serial * Float intens test part one, renaming * Fixed a weird issue probably resulting from races when writing out the floating intensity coefficients.Floating intensity switch now functional * renamed engine * Fixed accelerate tests * using older version of pyopencl to fix travis build * Restore normal Storage.data at the end, since its context has been changed in iterate * removed print statements * made support constraint work by copying data to CPU and back * Added pycuda tests for gaussian filter, expected to fail * new template DM/ML pycuda template * Added complex convolution kernel, tests passing * tiding up * object smoothing in DM_serial, DM_pycuda and DM_pycuda_streams * fixed smoothing preconditioner in ML_pycuda * cast to int to avoid np.isscalar warning * cast to int to avoid np.isscalar warnings * include archflag in linker command to avoid warnings * Create new context in engine_initiatlize \nthis allows to run multiple pycuda engines back-to-back * Cleaning up and preparing for log-likelihood * fix name of engine in diamond benchmarks * Added array-based kernel for log likelihood and tested against regular ptypy LL * log-likelihood error in DM_serial * prepare for log-likelihood kernel, test is failing * new kernel for log_likelihood, improved tests, seems to work * clean up * more cleanup * Implemented exit error in DM_serial, including unit test * Implemented exit_error dor DM_pycuda doing the reduction on the CPU, works but is slow * Move tests for exit_error to Fourier kernel * Full implementation of exit error on the GPU, works in tests and example * simplified log-likelihood reduction * added todo * added new line * Removed debugging traces * fixed tests * Fixed bug in DM_pycuda * Added error metrics to DM_pycuda_streams * Bugfix in pycuda_streams * device memory pool is causing problems in tests * ML pycuda is broken -> fixed (#269) * queue synchronisation needed * Added option to disable DMPs * Benchmark test scripts for ML * bugfix: forgot to create placeholder for GPU memory * fix in position refinement kernel * [WIP] Enable nearfield propagation for pycuda engines (#275) * Enable nearfield propagation, first pass * Farfield works, but still problems with self.fw and self.bw in nearfield propagation * New template for testing nearfield with DM pycuda * Removed lambdas for nf prop still not tested * Using correct FFTs for nf prop * Need a third fft for nf prop * removed unnecessary code * removed unnecessary filter * Added acceleration tests for propagation kernel, nf prop still broken * modified nearfield templates * Make sure nf prop kernels have correct dtype * cosmetic changes and small bug-fix in prop kernel * Changed template size back to 1024 Co-authored-by: Julio Cesar DA SILVA <[email protected]> Co-authored-by: Benedikt Daurer <[email protected]> * create rank_local for non-mpi start tackles one issue in #255 * [bugfix] Gaussian kernel now works with non-square images (#276) * Allow for post-iterate modifications in ML_serial * Bugfix in gaussian filter, now works with non-square images * bugfix: update fft3 queue only in nearfield case * apply probe support for non-MPI case, this closes #277 * Introduce fourier_power_bound parameter (#279) * Introduce fourier_relax_normalization parameter * Introduce explicit parameter for power bound * Include theoretical power bound value for Poisson data in doc * Fix Gaussian filtering kernel (#281) * make Gaussian filter work for low sigma values * Remove scipy dependency * Make copy instead changing kernel syntax * Take abs of data (same as DM engine) * Revived old streaming engine * Bug fix to save new pos for new DM engines (#284) * initial fix to save new pos for new DM engines * Finished fix to save positions * clean up * Use consistent variable names * Bugfix: only convert to new coords when posref used * update test workflow * Improve data loading scaling with decreasing frames_per_block (#286) * move reformat/initialize out of scan model * cleaning up * move tests to root level * fix imports * Added new DM_local engine, currently does nothing * started working on iterator * DM_local works with alpha=0 * DM_local work in progress * same power bound for all scans * Use shuffled vieworder * WIP: Reorganizing GPU efforts (#291) * Reorganizing GPU efforts * Fixed some imports * Tests passing. Accelerated engines all broken. * More import statement fixes * Further import Fixes. Reikna is unhappy Reikna test complain about an "invalid resource handle". Needs further investigation * Pycuda engines fail silently, again. * Separated useful stuff from array_based into base * Accelerate base tests are working. Adding explict imports in templates * Amended test workflw * Update test.yml * fixed imports * fixed imports in accelerate tests * more import fixes * More import fixes * Save out arrays for debugging * Added DLS tests based on real data * most of dls_tests are working now * improve dls_tests * DM_pycuda_stream fully functional plus extras (#292) * Reviving pycuda stream engine part 1 * Testing new mem manager * First pass on DM_pycuda_stream with GPUData * Ready for testing * Still testing * Three stream flow with per data event. * Added position refinement * Made DM_pycuda and DM_pycuda_stream compatible with staggered data * Removed block limit, Datamanager now grows dynamically * Need to copy back pagelocked memory, some cleanup * Exposed FFT choice to users. Disentangled the 2 cuda-based FFTs Co-authored-by: Benedikt Daurer <[email protected]> * DLS real data tests now working, not all are passing * Test with atomics for now * improved dls_tests * small change to dls_tests * only read regul data for regularization tests * Testing make_a012: still failing * Fixed import * Testing probe/object update without atomics * Gpu flexible datatypes (#294) * generalised and flexible data types for fill_b kernels * configurable data types for batched_multiply * build_aux kernels and variants with flexible dtypes * flexible data type for build_exit * flexible data types for error_reduce * finite difference kernel update for consistent and flexible data types * consistent naming of data types in dot.cu * flexible types in exit_error.cu * fmag_all_update kernel with flexible datatypes * adjustable data types in fourier_error.cu * configurable data types for full_reduce * gd_main with flexible data types * flexible data types in log_likelihood * flexible data types in intens_renorm * flexible dtypes in update_addr_error_state * flexible data types for make_a012 * better error output from kernel compilation by inserting a line directive * flexible data types for make_model * flexible data types in ob_update_ML * flexible data types in ob_update * flexible data types on ob_update2_ML * type-generic ob_update2 * flexible data types for pr_update_ML * flexible data types for the pr_update kernel * flexible data type for pr_update2_ML * flexible data types for pr_update2 * flexible data type on transpose * flexible data type for kernel in convolution * removing old type substitutions * fixing explicit type casts * adding an ACC_TYPE to the tiled update kernels * adding note to explain the register-spilling effect on the tiled update kernels * Making ob/pr denominator real, tests passing (#295) * removed unused code throwing a confusing error * Gpu precision and bugfixes (#296) * fixing bug in DLS test, transferring the wrong data to GPU * investigations / improvements re make_a012 precision errors * fixing explicit type casts * adding an ACC_TYPE to the tiled update kernels * fixing non-atomic ob_update versions for ob dimensions * fixing gradient descent data type specification in test * simplifying ob/pr updates by removing denominator type (#297) * Save Imodel * Added crop_pad and testing (#300) * Added crop_pad and testing * Added GPU tests for crop_pad_simple Co-authored-by: Benedikt Daurer <[email protected]> * Gpu hackathon: intensity kernel fix (#301) * fixing intensity kernel race condition without extra memory * Save Imodel Co-authored-by: Benedikt Daurer <[email protected]> * Gpu hackathon: accuracy scripts (#302) * accuracy testing script for gradient descent kernels * forgotten return statements * fix in results building Co-authored-by: Benedikt Daurer <[email protected]> * Gpu hackathon: crop pad (#303) * adding a 4D test case for crop-pad * crop/pad GPU tests are passing * Integrated crop_pad into propagator, simple tests passing * Added tests for crop/pad, refactored ArrayUtilsKernel. * adding BDIM_X/BDIM_Y to other uses of the error_reduce kernel * conditionally enable -std=c++14 flag depending on CUDA version Co-authored-by: Benedikt Daurer <[email protected]> Co-authored-by: Bjoern Enders <[email protected]> * consolidate yml files (#307) * WIP: Local douglas rachford algorithm (#304) * Renamed engine to Douglas-Rachford (DR) and added citation * working on tests * Use build_aux kernel to compute pr*ob product * Add unit tests for new kernels * formatting * Added tests for build_exit_alpha_tau * adding prototypes to make tests runnable * fixing reference implementation for alpha_tau test * implementation of build_exit_alpha_tau on GPU * Separate the maximum norm from the main update * Updated tests * First draft of DR_pycuda engine * Properly reading back the errors * kernel + tests for max_abs2 * ob_update_local is working on GPU * Added debugging output * simplified base kernel, we should be able to also simplify the CUDA kernel for max_abs2 * pr_update_local working on GPU * DR pycuda engine is running, but having illegal memory access issues * Define grid by using addr instead of ex * simplified / refactored max_abs2 as independent function * Changed addr in the DR engine * norm is now of type IN_TYPE * DRpycuda engine working now * clean up * more clean up, made exit_error optional * added dls_test for update_local * don't need pbound in DR and can make fourier_error optional * optimised GPU kernels for DR engine, using a thread block per Y dimension * max norm for DR engine needs to sum over modes * ob/pr norm is a single value now * No need for lists, pycuda can do slicing :) * no need anymore for lists when copying errors back * typo * allow changing block dimensions easily from python * adjusting in case of BDIM_Y > 1, we need to return early then * adding fourier_deviation kernel to GPU - tested against fourier_error * fourier_deviation integrated in DR engine * fmag_all_update without pbound * cleaned up and renamed fmag_update_nopbound * Trying a different strategy for shuffling the vieworder * adding a build_aux2 with different parallelisation strategy * build_aux_no_ex with different parallelisation scheme * integrate new build_aux kernels into DR engine * load_kernel supports multiple kernels / file + refactor to keep code DRY * fixing max_abs2 and local updates to aggregate over modes * better parallelisation on the log_likelihood error * avoid extra copy on the CPU for D2H transfers * make vieworder shuffling simpler again * first attempt to DR streaming engine * use random shuffle for vieworder in streaming engine * allow for modes in engine, add templates * Bring DR engines back in sync * Test ob/pr update local with modes * updates to DR engine to fix sizing and transfers * updating benchmark script to new API * made DR work with modes * fixing crash on shutdown due to pagelocked memory * increased MAX_BLOCKS and clean up * Fixed typo Co-authored-by: Jorg Lotze <[email protected]> * Use blockmodel in DR templates * added benchmark script that fails with DM_pycuda_stream * update scan name * Fixed bug in GpuDataManager2 that would overallocate blocks. * fixing transpose kernel call, as that moved to its own class * includes scratchmem sizi in dict key (#308) * updates to imported FFT to compile all supported sizes into the same module * integrating filtered_cufft in setup.py * cleanup and re-organising file locations * Revert accidental commit: "cleanup and re-organising file locations" This reverts commit 6c4904b. * Revert accidental commit: "integrating filtered_cufft in setup.py" This reverts commit ce89ee7. * Revert acidental commit: "updates to imported FFT to compile all supported sizes into the same module" This reverts commit c8b3f7b. * Fix in context initialisation to raise an exception (#312) * fix in context initialisation to raise an exception in case more processes than GPUs are created * More verbose error and allow to create new stream with existing context * improved error message Co-authored-by: Benedikt Daurer <[email protected]> * WIP: position correction (#309) * Introduce grid search option for position refinement * fixed bug in address mangler * re-designed position correction base kernel, added grid search * Make sure we stay within valid bounds * base address manglers tests * address mangler's get_address on GPU (tests) * integrating GPU-based address manglers in DM engines * Fix typo in DM_serial, clean up debugging traces * avoid expensive re-allocations for deltas in address manglers * position grid search seems to work again with all DM engines * simplified address mangler * Template scripts for position correction * use a raw memcopy for the deltas to GPU, which will also work for differing sizes * Fixing data type and memcopy for the deltas in address manglers * Remove warning message * fixing typo for transpose kernel + setting position correction stream * Implement "photon" metric in all DM engines * need to synchronize * starting to add position correction in ML * Add templates for position refinement * It does not make sense to implement position correction for ML in this way Co-authored-by: Jorg Lotze <[email protected]> * no need to test for ML with position refinement * archived extensions.py * Gpu smoothing fix (#314) * work in progress refactoring of convolution kernel * tests for gaussian smoothing are passing now * integrating new smoothing kernels into engines * create the tmp array if not given * avoid repeatedly creating tmp array Co-authored-by: Jorg Lotze <[email protected]> * Precompile cufft during setup to avoid MPI failures and speed up execution (#313) * updates to imported FFT to compile all supported sizes into the same module * integrating filtered_cufft in setup.py * cleanup and re-organising file locations * fixing typos * made cufft extension module optional in setup.py (enabled by default for now) * replaced cmdline flag with try/except * moved setupext into accelerate folder * move extension back to root level, improved build message Co-authored-by: Benedikt Daurer <[email protected]> * needed to make changes in position correction tests * Gpu NCCL wrapper (#310) * multi-GPU wrapper using NCCL for allReduce * Implementation and generalisation of the multi-gpu tests, incl. cuda-aware MPI * adding C++ MPI test for cuda-aware MPI * multi-gpu support integration in DM_pycuda_stream - work in progress * clean up and findings for multi-gpu implementation * probe allreduce and change calc on GPU for all DM engines * Change smoothing message to level 4 * use multigpu.allReduceSum in all DM engines * Moved support constraint to GPU for DM engines * Attempt to write clip_magnitudes kernel, unity test fails * Integrate clip magnitues kernel into DM engines, still off for now * working on clip magnitudes kernel * adjusting test to pass complex<float> * use clip_object * need to pass gpu array * adding reproducer script used for reproducing nccl crash in the engines * adding dummy call to build_aux_no_ex to test * Fixing Nccl issue - Streams allocated before NCCL can't be used afterwards * move smoothing message to log level 4 * use more simple syntax for DtoD copies * this avoids clean up error when using NCCL * Use multigpu allreduce for change, clean up * remove benchmarks from pycuda engines and move most logging to level 4 * cosmetic changes Co-authored-by: Benedikt Daurer <[email protected]> * checking at runtime if nccl/cuda-mpi are available * Fixed bugs in address manglers * Fixed bug in DM stream engines related to smoothing/object update * reversing the order of the support constraints (#315) * reversing the order of the support constraints * now making the intended change * Make basic fourier update a true blend between DM and AP (#288) * Make basic update a true blend between DM and AP * made all DM updates a true blend of DM and AP. * Cleaned up debugging traces * pycuda engines need to be explicitely imported * remove print statement * We need a third object copy for smoothing in the stream engines (#321) * We need a third object copy for smoothing * switch the temporary buffers obb.gpu and obb.tmp * small bugfix * bugfix: rescale size of aux when using MPI (#322) * bugfix: rescale size of aux when using MPI * same MPI rescaling of aux for the serial engines * Update multi_gpu.py (#324) This is a quick fix for when the nccl library in cupy comes back as `_UnavailableModule`. Alternatively this could be put catched the __init__ bits of `MultiGpuCommunicatorNccl` where it raises a `RuntimeError` then. * Add option to choose fft lib in ML_pycuda (#326) * add padding to the HDF5loader (#330) * Allow different masks for spectro scans (#333) * Record new positions only if requested (#328) * make saving of new positions optional * saving grids should not be optional * Make recording of local error map optional (#329) * Make recording of local error map optional * set userlevel to 2 * Check if position refinement amplitude is large enough (#331) * bugfix: correctly resize the aux shape (#332) * Use correct aux shape in propagation kernel tests * Update release_notes.md * gather_dict/bcast_dict can create issues with multinode MPI (#327) * simplify gather_dict, works with multiple nodes * Keep previous code for gather_dict * Simplify bcast_dict, works with multiple nodes * merge back into single bcast_dict, remove in-place * Position refinement for ML (#334) * working on posref for ML * position refinement works with ML_pycuda * use asynchronous copies, getting illegal memory access * moved sqrt calculation after synchronizing event * Wrong type for `ma`, must be float not bool * needed to specify out array in cumath.sqrt Co-authored-by: Bjoern Enders <[email protected]> * avoid elementwise comparison * avoid elementwise comparison (after fixing typo) * only update views for the original container (#337) * remove automatic import of all experiment classes (#338) * WIP: Python 3.9 (#339) * include Python 3.8 and 3.9 in GitHub Actions workflow * first round of syntax changes after running flake8 * more flake fixes and ignoring some cases * enable syntax checking in GitHub Actions workflow * fixed syntax in test.yaml * us float() instead of np.float() which is deprecated in numpy 1.20 * specify numpy dtype to avoid deprecation warnings * more fixes and ignore statements, the code now passes E9,F63,F7,F82 checks * convert to raw strings to avoid invalid escape sequence warnings * one more rawstring conversion * removed deprecated code, updated release notes * typo * Allow for non-boolean frame filter in HDF5Loader * removed dependency * fixed more invalid escape sequences * Fixed parsing of framefilter and solved dtype deprecation warnings * More dtype fixes to remove deprecation warnings * select outer index for framefilter * log power bound * save floating intensities in ML_serial and ML_pycuda (#353) * Clean exit when data does not fit into device memory (#354) * DM_pycuda engines don't fully clear memory (#352) * improving cleanup, still not all memory freed * Use MPI instead of NCCL by default * add empty line * Remove traces of OpenCL engine (#311) * padding to longer necessary (legacy of opencl version) * remove unnecessary padding from DR_serial * correct order fixes problem with setstream tests * fixed dtype * convert to int to avoid deprecation warnings * WIP: include a general form of projection update (#361) * Refactored Fourier update into general form, RAAR included * Refactor stupidity * Refactor stupidity part 2 plus updated docs * Renamed DR to DM * Docststrings updated * Added first unity test for projection_update_generalized and removed severe bug * Removed output files for engine tests * Added general projection engine * Added general kernels. Modified kernel classes * Fitted engines to projection form. * Minor bug fixes, typos. Auxiliary Wave Kernel test run. * Fixed inheritance * Updated pyucda_streams * Fixed RAAR update * Renamed and moved DM engines * Minor fixes for tests * test fix again * test fix Co-authored-by: Benedikt Daurer <[email protected]> * Stochastic Douglas-Rachford algorithm (#359) * Renamed to SDR and implemented core engine * Working on local probe/object update * Generalise stochastic engines * working on basic stochastic engines * Refactor of stochastic engines, including serial engines * Added CUDA kernels for local ob/pr norm, tests passing * refactor of stochastic PYCUDA engines (EPIE/SDR) - tests passing * removed debugging traces * Merged CUDA stream features into stochastic pycuda base engine * Integrating posref into stochastic engines, still in progress * introduce decay parameter for posref * Position refinement works for all stochastic engines * Use class mixing and combine all stochastic engines in single file * Move params and citation to Mixin classes * fixed tests * remove duplication in docstrings * prepare for merge with generic fourier_update * integration of general fourier update (in progress) * use generalised fourier update in stochastic engines, introduce rescale parameter * fixed imports * include more probe parameters in stochastic engines * New definition of the generic update * fixed typo * Optional global object norm in ePIE (#367) * adding option to use global object norm for ePIE * added global object norm option for pycuda engine * Moved new object_norm_is_global parameter into EPIE engine * remove files that have been uploaded by accident * Introducing the idea of customized engines with an object regulariser (#358) * Introducing the idea of plugins with an object regulariser * Include shape check for DM plugin * Provide option to only regularise the phase * clean up and improve docs * improve doc * added test script * Created new folder mods for customized engines, added start/stop to object regulariser * Moved back to plugin structure inside the package * small change to test script * renamed plugins to custom prepare for merge * clean exit that also works with interactive python (e.g. notebooks) (#368) * move non-standard engines into custom folder (#369) * moved some of the engines to custom * remove engines from ptypy/engines * comment dynamic load * fixed imports in engine tests * Make fft chooser more intelligent, update compute_levels (#370) * Make fft chooser more intelligent, update compute_levels * fixed fft chooser and completed compute levels * no need to make variable private * cleaned up previous commit * Improve warning message * roll back to require CUDA >=11.0 for cufft * cufft compile flags (#372) * cuda version dependent arch flags * improve messaging * raise error if CUDA version is not supported * use probe_update_start in stochastic serial/pycuda engines * bugfix: forgot new argument in FFT chooser * Improve logging (#371) * elevate citation info to critical * move timing log from warning to info * working on new interactive verbose level * added interactivelog messages * small formatting change * removed timing from interactive logging * small fix to interactive logging and example notebooks * include ipynb checkpoints in gitignore * improved interactive logging * use string for logging citation * clean up * Updating release notes * Fix how frames_per_block is defined (#375) * access the frames per block in the engines via the scan model * remove debugging * making sure fpc is per MPI rank * added convenience loaders for GPU engines and ptyscan modules (#376) * added convenience loaders for GPU engines and ptyscan modules * Added tests for auto loaders * Rebranding projectional pycuda engines (#377) * register different names, but keep classes the same * EPIE improvements (#378) * Added ePIE model with lower memory footprint * Renamed model to GradFull and BlockGradFull and made sure it works with ePIE and ML * make variable private * added more documentation * moved definition of supported models * moved supported models into mixin classes * moved scan model check to initialize * added todo * make MPI optional (#379) * Added wrapper for timing main ptycho functions (#381) * Wrap main ptycho functions in LogTime * move benchmarks out of runtime * add user level to benchmark parameter * Derive engine name from params not class * Shift object and exit waves during centring probe (#373) * Shift object and exit waves during centring probe * [WIP] mass center cuda * Finished mass_center for 2D * Completed the mass_center 3D case * Loop through different exit wave storages when centering probe * Fix a bug in the size of threads and block in final_sums in mass_center * Simplify the looping syntax * Implement center_probe in stochastic algorithm * Put center_probe in projectional_serial * Finish the abs2sum kernel * [WIP] interpolated shift kernel The starting value of a block for positive shift is wrong, and the last value of a block for negative shift is wrong?? * Fix a bug in linear_interpolate_kernel in interpolated shift The four corners of the Halo were missing, leading to the usage of random values in shared memory when performing linear interpolation. The four corners of the Halo are now defined correctly. * Fix another bug in linear_interpolate_kernel (swapped rows and columns) rows and columns variables are swapped, may result in pre-mature return. * Finish the interpolated shift kernel * Implement center_probe in projectional_pycuda * Reduce number of loopings when shifting exit waves * Revert "Reduce number of loopings when shifting exit waves" This reverts commit 4e5e22c. * Adjust the way to access object and gpu data in projectional * Move the center_probe method out of inner loop in projectional * Adjust the way to access object in stochastic * Move the center_probe method out of inner loop in stochastic * Implement the center_probe in stochastic pycuda * fixed imports for opencl engines * Pre-determine if data is distributed (#335) * introduce new scaling flag for Container and pre-determine if data is distributed * need to check if MPI is enabled * name change * made variable "distributed" private and renamed to "_is_scattered" * Reorganise templates (#384) * remove unnecessary init files * move template scripts into subfolders * updated and tested basic ptypy templates * need an init for test folder * updated position refinment templates * more changes to templates * cleaned up moonflower engine templates * more changes to engine templates * added templates for RAAR engines * more changes to templates * fixed farfield example * moved delayed scripts to live processing folder * organised model and misc templates, added new notebooks * benchmarks scripts up-to-date * small fix in diamond benchmarks * add minimal prep and run script in accelerate templates * updated dependencies for pycuda engines * [WIP] Memory management for ML_pycuda engine revised after projectional_pycuda_stream (#382) * First round of transferring mem management from DM to ML * Data management replaced except for pos corr update * fixed import * Add gpu supp constraint, harmonized engine.finalize * Added alternative LL calculation for Position correction + tests * Allowed for index reversal after pos corr run * Wrong log_likelihood in second call in pos corr * Moved pycuda_streams to archive. Cleaned ML_pyuda and mem_utils * cleaned up gpudata test * temporarily disable flake8 linter (causing trouble with python 3.8) * fix import * fixed goudata tests and small bug in mem utils * new hdf5 loader using mutliprocessing (#380) * WIP: Unit tests for engines (#389) * ML engine tests * debug * debugging ML precision * working on engine tests for ML_serial * debugging * Cleaning up, some engine tests still failing * more cleanup and small changes * more cleanup * engine tests passing with tol=1e-2 * revert Brenorm changes * moved cufft extension into separate module (#390) * moved cufft extension into separate module * cleaning up * fixed small bug in log_likelihood.cu * WIP: Release 0.5 (#393) * Path to default sphinx layout has changed * Initial doc push for release * more updates * forget a file * Removed OrderedDict from h5rw * Fixed parts of h5info. Fixed guide. Release nodes * Responded to Benedikts comments * Sequestered cufft requirement * more on dependencies * Cleaning up references * Fixed stylesheet * minimal fix * Added format parameter to save in alternate format * bugfix * bugfix bugfix * kept the record_positions switch * Root level resources dir gutted, moved to pip install Co-authored-by: Benedikt Daurer <[email protected]> Co-authored-by: Benedikt J. Daurer <[email protected]> * archived unused tests, specified valid tests in setup.cfg * simplified workflow, put back linter * formatting changes for 0.5 release * formatting changes to release notes * adjust length of underlines * typeDict is a depcreated alias of sctypeDict * _MODE_CONV removed with PIL 9.1.0 (#399) Co-authored-by: Aaron Parsons <[email protected]> Co-authored-by: Benedikt Daurer <[email protected]> Co-authored-by: Benedikt J. Daurer <[email protected]> Co-authored-by: Julio Cesar DA SILVA <[email protected]> Co-authored-by: Jorg Lotze <[email protected]> Co-authored-by: Timothy Poon <[email protected]>
ptycho · May 4, 2022 · fc3c293 · fc3c293
1 parent 1140ff4
commit fc3c293
Show file tree

Hide file tree

Showing 438 changed files with 55,657 additions and 1,557 deletions.
diff --git a/.clang-format b/.clang-format
@@ -0,0 +1,65 @@
+---
+Language:        Cpp
+# BasedOnStyle:  Google
+AccessModifierOffset: -2
+AlignAfterOpenBracket: true
+AlignEscapedNewlinesLeft: false
+AlignOperands:   true
+AlignTrailingComments: true
+AllowAllParametersOfDeclarationOnNextLine: true
+AllowShortBlocksOnASingleLine: false
+AllowShortCaseLabelsOnASingleLine: false
+AllowShortIfStatementsOnASingleLine: false
+AllowShortLoopsOnASingleLine: true
+AllowShortFunctionsOnASingleLine: All
+AlwaysBreakAfterDefinitionReturnType: false
+AlwaysBreakTemplateDeclarations: true
+AlwaysBreakBeforeMultilineStrings: true
+BreakBeforeBinaryOperators: None
+BreakBeforeTernaryOperators: true
+BreakConstructorInitializersBeforeComma: false
+BinPackParameters: false
+BinPackArguments: false
+ColumnLimit:     80
+ConstructorInitializerAllOnOneLineOrOnePerLine: true
+ConstructorInitializerIndentWidth: 4
+DerivePointerAlignment: true
+ExperimentalAutoDetectBinPacking: false
+IndentCaseLabels: true
+IndentWrappedFunctionNames: false
+IndentFunctionDeclarationAfterType: false
+MaxEmptyLinesToKeep: 1
+KeepEmptyLinesAtTheStartOfBlocks: false
+NamespaceIndentation: None
+ObjCBlockIndentWidth: 2
+ObjCSpaceAfterProperty: false
+ObjCSpaceBeforeProtocolList: false
+PenaltyBreakBeforeFirstCallParameter: 1
+PenaltyBreakComment: 300
+PenaltyBreakString: 1000
+PenaltyBreakFirstLessLess: 120
+PenaltyExcessCharacter: 1000000
+PenaltyReturnTypeOnItsOwnLine: 200
+PointerAlignment: Left
+SpacesBeforeTrailingComments: 2
+Cpp11BracedListStyle: true
+Standard:        Auto
+IndentWidth:     2
+TabWidth:        8
+UseTab:          Never
+BreakBeforeBraces: Allman
+SpacesInParentheses: false
+SpacesInSquareBrackets: false
+SpacesInAngles:  false
+SpaceInEmptyParentheses: false
+SpacesInCStyleCastParentheses: false
+SpaceAfterCStyleCast: false
+SpacesInContainerLiterals: true
+SpaceBeforeAssignmentOperators: true
+ContinuationIndentWidth: 4
+CommentPragmas:  '^ IWYU pragma:'
+ForEachMacros:   [ foreach, Q_FOREACH, BOOST_FOREACH ]
+SpaceBeforeParens: ControlStatements
+DisableFormat:   false
+...
+
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -21,37 +21,39 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       max-parallel: 5
+      matrix:
+        python-version: [3.7,3.8,3.9]
 
     steps:
     - uses: actions/checkout@v2
-    - name: Set up Python 3.7
+    - name: Set up Python ${{ matrix.python-version }}
       uses: actions/setup-python@v2
       with:
-        python-version: 3.7
+        python-version:  ${{ matrix.python-version }}
     - name: Add conda to system path
       run: |
         # $CONDA is an environment variable pointing to the root of the miniconda directory
         echo $CONDA/bin >> $GITHUB_PATH
     - name: Install dependencies
       run: |
-        conda env update --file ptypy_core_dependencies.yml --name base
+        conda env update --file dependencies_core.yml --name base
     - name: Prepare ptypy
       run: |
         # Dry install to create ptypy/version.py
         python setup.py install -n
-#    - name: Lint with flake8
-#      run: |
-#        conda install flake8
-#        # stop the build if there are Python syntax errors or undefined names
-#        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
-#        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
-#        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Lint with flake8
+      run: |
+        conda install flake8
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        # flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
     - name: Test with pytest
       run: |
         conda install pytest
         conda install pytest-cov
-        # pytest ptypy/test --doctest-modules --junitxml=junit/test-results.xml --cov=ptypy --cov-report=xml --cov-report=html --cov-config=.coveragerc
-        pytest
+        # pytest ptypy/test -v --doctest-modules --junitxml=junit/test-results.xml --cov=ptypy --cov-report=xml --cov-report=html --cov-config=.coveragerc
+        pytest -v
 #    - name: cobertura-report
 #      if: github.event_name == 'pull_request' && (github.event.action == 'opened' || github.event.action == 'reopened' || github.event.action == 'synchronize')
 #      uses: 5monkeys/cobertura-action@v7

diff --git a/.gitignore b/.gitignore
@@ -28,3 +28,4 @@ ptypy/version.py
 /env
 *.egg-info
 .DS_Store
+.ipynb_checkpoints
diff --git a/.travis.yml b/.travis.yml
@@ -3,6 +3,8 @@ sudo: true
 language: python
 python:
   - 3.7
+compiler:
+  - gcc
 before_install:
   - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh; # grab miniconda
   - bash miniconda.sh -b -p $HOME/miniconda # install miniconda
@@ -12,6 +14,21 @@ before_install:
   - conda install pyyaml
   - conda info -a # and print the info
 
+  - CUDA=10.1.105-1
+  - CUDA_SHORT=10.1
+  - UBUNTU_VERSION=ubuntu1804
+  - INSTALLER=cuda-repo-${UBUNTU_VERSION}_${CUDA}_amd64.deb
+  - wget http://developer.download.nvidia.com/compute/cuda/repos/${UBUNTU_VERSION}/x86_64/${INSTALLER}
+  - sudo dpkg -i ${INSTALLER}
+  - wget https://developer.download.nvidia.com/compute/cuda/repos/${UBUNTU_VERSION}/x86_64/7fa2af80.pub
+  - sudo apt-key add 7fa2af80.pub
+  - sudo apt update -qq
+  - sudo apt install -y cuda-core-${CUDA_SHORT/./-} cuda-cudart-dev-${CUDA_SHORT/./-} cuda-cufft-dev-${CUDA_SHORT/./-} cuda-curand-dev-${CUDA_SHORT/./-}
+  - sudo apt clean
+  - CUDA_HOME=/usr/local/cuda-${CUDA_SHORT}
+  - LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}
+  - PATH=${CUDA_HOME}/bin:${PATH}
+
 env:
   - TEST_ENV_NAME=ptypy_core_dependencies
   - TEST_ENV_NAME=ptypy_full_dependencies
@@ -27,7 +44,7 @@ script:
   - echo $PYTHONPATH
   - conda list
   - python setup.py install # install ptypy
-  - py.test ptypy/test -v --cov ptypy --cov-report term-missing # now run the tests
+  - py.test test -v --ignore=ptypy/test/accelerate_tests --cov ptypy --cov-report term-missing # now run the tests
 
 after_script:
   - coveralls

diff --git a/README.rst b/README.rst
@@ -33,15 +33,18 @@ To get started quickly, please find the official documentation at the project pa
 Features
 --------
 
-* **Difference Map** [#dm]_ algorithm engine with power bound constraint
+* **Difference Map** [#dm]_ algorithm engine with power bound constraint [#power]_.
 * **Maximum Likelihood** [#ml]_ engine with preconditioners and regularizers.
+* A few more engines (RAAR, sDR, ePIE, ...).
 
 * **Fully parallelized** (CPU only) using the Massage Passing Interface 
   (`MPI <https://en.wikipedia.org/wiki/Message_Passing_Interface>`_). 
   Simply execute your script with::
   
     $ mpiexec -n [nodes] python <your_ptypy_script>.py
 
+* **GPU acceleration** based on custom kernels, pycuda, and reikna.
+
 * A **client-server** approach for visualization and control based on 
   `ZeroMQ <http://www.zeromq.org>`_ .
   The reconstruction may run on a remote hpc cluster while your desktop
@@ -60,11 +63,11 @@ Installation
 
 Installation should be as simple as ::
 
-   $ sudo python setup.py install
+   $ sudo pip install .
 
 or, as a user, ::
 
-   $ python setup.py install --user
+   $ pip install . --user
 
 
 Dependencies
@@ -130,3 +133,5 @@ References
 .. [#dm] P.Thibault, M.Dierolf *et al.*, *Science* **321**, 7 (2009), `doi <http://dx.doi.org/10.1126/science.1158573>`_
 
 .. [#ml] P.Thibault and M.Guizar-Sicairos, *New J. of Phys.* **14**, 6 (2012), `doi <http://dx.doi.org/10.1088/1367-2630/14/6/063004>`_
+
+.. [#power] K.Giewekemeyer *et al.*, **PNAS 108**, 2 (2007), `suppl. material <https://www.pnas.org/doi/10.1073/pnas.0905846107#supplementary-materials>`__, `doi <https://doi.org/10.1073/pnas.0905846107>`__
diff --git a/archive/array_based/__init__.py b/archive/array_based/__init__.py
@@ -0,0 +1,7 @@
+'''
+A module for gpu acceleration
+
+'''
+import numpy as np
+COMPLEX_TYPE= np.complex64
+FLOAT_TYPE = np.float32
diff --git a/archive/array_based/array_utils.py b/archive/array_based/array_utils.py
@@ -0,0 +1,87 @@
+'''
+useful utilities from ptypy that should be ported to gpu. These don't ahve external dependencies
+'''
+import numpy as np
+from scipy import ndimage as ndi
+
+
+def dot(A, B, acc_dtype=np.float64):
+    assert A.dtype == B.dtype, "Input arrays must of same data type"
+    if np.iscomplexobj(B):
+        out = np.sum(np.multiply(A, B.conj()).real, dtype=acc_dtype)
+    else:
+        out = np.sum(np.multiply(A, B), dtype=acc_dtype)
+    return out
+
+
+def norm2(A):
+    return dot(A, A)
+
+
+def abs2(input):
+    '''
+    
+    :param input. An array that we want to take the absolute value of and square. Can be inplace. Can be complex or real. 
+    :return: The real valued abs**2 array
+    '''
+    return np.multiply(input, input.conj()).real
+
+def sum_to_buffer(in1, outshape, in1_addr, out1_addr, dtype):
+    '''
+    :param in1. An array . Can be inplace. Can be complex or real.
+    :param outshape. An array.
+    :param in1_addr. An array . Can be inplace. Can be complex or real.
+    :param out1_addr. An array . Can be inplace. Can be complex or real.
+    :return: The real valued abs**2 array
+    '''
+    out1 = np.zeros(outshape, dtype=dtype)
+    inshape = in1.shape
+    for i1, o1 in zip(in1_addr, out1_addr):
+        out1[o1[0], o1[1]:(o1[1] + inshape[1]), o1[2]:(o1[2] + inshape[2])] += in1[i1[0]]
+    return out1
+
+def norm2(input):
+    '''
+    Input here could be a variety of 1D, 2D, 3D complex or real. all will be single precision at least.
+    return should be real
+    '''
+    return np.sum(abs2(input))
+
+def complex_gaussian_filter(input, mfs):
+    '''
+    takes 2D and 3D arrays. Complex input, complex output. mfs has len 0<x<=2
+    '''
+    if len(mfs)>2:
+        raise NotImplementedError("Only batches of 2D arrays allowed!")
+
+    if input.ndim == 3:
+        mfs = np.insert(mfs, 0, 0)
+
+    return (ndi.gaussian_filter(np.real(input), mfs) +1j *ndi.gaussian_filter(np.imag(input), mfs)).astype(input.dtype)
+
+def mass_center(A):
+    '''
+    Input will always be real, and 2d or 3d, single precision here
+    '''
+    return np.array(ndi.measurements.center_of_mass(A), dtype=A.dtype)
+
+def interpolated_shift(c, shift, do_linear=False):
+    '''
+    complex bicubic interpolated shift.
+    complex output. This shift should be applied to 2D arrays. shift should have len=c.ndims 
+    
+    '''
+    if not do_linear:
+        return ndi.interpolation.shift(np.real(c), shift, order=3, prefilter=True) + 1j*ndi.interpolation.shift(np.imag(c), shift, order=3, prefilter=True)
+    else:
+        return ndi.interpolation.shift(np.real(c), shift, order=1, mode='constant', cval=0, prefilter=False) + 1j * ndi.interpolation.shift(np.imag(c), shift, order=1, mode='constant', cval=0, prefilter=False)
+
+
+def clip_complex_magnitudes_to_range(complex_input, clip_min, clip_max):
+    '''
+    This takes a single precision 2D complex input, clips the absolute magnitudes to be within a range, but leaves the phase untouched.
+    '''
+    ampl = np.abs(complex_input)
+    phase = np.exp(1j * np.angle(complex_input))
+    ampl = np.clip(ampl, clip_min, clip_max)
+    complex_input[:] = ampl * phase
diff --git a/archive/array_based/base.py b/archive/array_based/base.py
@@ -0,0 +1,22 @@
+from collections import OrderedDict
+
+class Adict(object):
+
+    def __init__(self):
+        pass
+
+
+class BaseKernel(object):
+
+    def __init__(self, queue_thread=None, verbose=False):
+
+        self.queue = queue_thread
+        self.verbose = False
+        self.npy = Adict()
+        self.ocl = Adict()
+        self.benchmark = OrderedDict()
+
+
+    def log(self, x):
+        if self.verbose:
+            print(x)