Skip to content

Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel renderer.

Notifications You must be signed in to change notification settings

Enigmatisms/cuda-pt

Repository files navigation

CUDA-PT


Software Path Tracing renderer implemented in CUDA, from scratch.

Distributed parallel rendering supported, via nanobind and PyTorch DDP.

sports-cars

Malorian-Arms-3516

modern-kitchen

dispersion

Variance Depth BVH cost

Compile & Run

The repo contains several external dependencies, therefore, using the following command:

git clone https://github.com/Enigmatisms/cuda-pt.git --recursive
windows

Dependent on GLEW for the interactive viewer (./build/xx/cpt). If GLEW is not installed, only offline application is available (./build/xx/pt). GLEW should be manually installed. Initially, this code base can be run on Linux (tested on Ubuntu 22.04) but I haven't try that since the day my Ubuntu machine broke down. Currently, using MSVC (VS2022) with CMake (3.24+):

mkdir build && cd build
cmake --DCMAKE_BUILD_TYPE=release ..
cmake --build . --config Release --parallel 7 -j 7

(./build/xx/cpt.exe) and (./build/xx/pt.exe) will be the executable files. To run the code, an example is:

cd build/Release
./cpt.exe ../../scene/xml/vader.xml

Note that, if you have built the code and you have changed the code in src afterwards, on windows, you may have to use ./rm_devlink_obj.sh to delete all the .device-link.obj files in the build folder and build it again (it is strange that those compiled .obj files won't update, which might cause linking problem).

Linux

The following dependencies should be satisified:

sudo apt install libglew-dev libwayland-dev libxkbcommon-dev libxrandr-dev libxinerama-dev libxcursor-dev libxi-dev

Then run the following command with CMake (3.24+) and make:

mkdir build && cd build
cmake --DCMAKE_BUILD_TYPE=release ..
make -j7

Test the code

After successfully building the project, you will have three executables:

  • pt(.exe): Single GPU offline rendering.
  • cpt(.exe): Single GPU online rendering, GUI visualization and parameter tweaking.
  • pyrender: Python import-able: use help(PythonRenderer) to see what functionalities can be used.

More info

This repo currently has no plan for OptiX, since I am experiencing how to build the wheel and make it fast, instead of implementing some useful features. Useful features are incorporated (though now AdaPT is forsaken) in the experimental path tracer AdaPT. Check my github homepage for more information.

Currently, this repo supports:

  • Megakernel unidirectional path tracing.
  • Wavefront unidirectional path tracing with stream compaction. Currently, WFPT is not as fast as megakernel PT due to the simplicity of the test scenes (and maybe, coalesced GMEM access problems, being working on this).
  • BVH cost visualizer and depth renderer.
  • GPU BVH: A stackless GPU surface area heuristic BVH.
  • CUDA pitched textures for environment maps, normal, roughness, index of refraction and albedo.
  • Online modification of the scene. Check out the video down below.
cuda-pt-compressed.mp4
TODO
  • (Recent) An imgui based interactive UI.
  • (Around 2025.01, stay tuned) Benchmarking with AdaPT (Taichi lang based renderer) and OptiX (optional). More profiling, and finally, I think I will write several blog posts on "How to implement an efficient software path tracing renderer with CUDA". The blog posts will be more focused on the soft(and hard)-ware related analysis-driven-optimization, so they will actually be posts that summarize (and teach) some best practices for programming the tasks with extremely imbalanced workloads.
Tricks (that will be covered in my incoming blog posts)

I've tried a handful of tricks, unfortunately, due to the limitation of time I haven't document any of these (including statistical profiling and analysis) and I currently only have vague (somewhat) concepts of DOs and DON'Ts. Emmm... I really want to summarize all of them, in November, after landing on a good job. So wish me good luck.

  • Divergence control part I (loop 'pre-converge')
  • Divergence control part II: megakernel or wavefront?
  • Stream compaction for WFPT. Shader Execution Reordering (SER) on Ada Lovelace architecture (NVIDIA 40x GPU) (More in-depth reading on this topic, since NVIDIA said almost nothing important in their SER white-book).
  • Coalesced access: SoA in WFPT and lg-throttle problem for AoS
  • Local memory: dynamic indexing considered harmful
  • Dynamic polymorphism: GPU based variant or device-side inheritance (virtual functions and their pointers) ?
  • Avoiding bank conflicts & Use vectorized load / store
  • IMC (constant cache miss): when should you use constant cache
  • CPU multi-threading and GPU stream-based concurrency (maybe Hyper-Q).
  • (More in-depth reading on this topic) What makes a good GPU based spatially-partitioning data structures (like BVH): well I am no expert in this, should more papers on this topic.

Repography logo / Recent activity Time period

Timeline graph Pull request status graph Trending topics Top contributors

Repography logo / Structure

Structure

Visualizer Notes

  • imgui has no CMakeLists.txt so we should write it ourselves.
  • I think it is painful to use GLEW for windows: after compilation, glew32.dll should be manually copied to Windows/System32. Also, we should build GLEW manually.

About

Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel renderer.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published