Skip to content

Latest commit

 

History

History
177 lines (126 loc) · 14.3 KB

appC.asciidoc

File metadata and controls

177 lines (126 loc) · 14.3 KB

Appendix A: Debugging with Ray

Depending on your debugging techniques, moving to distributed systems could require a new set of techniques. Thankfully, tools like Pdb and PyCharm allow you to connect remote debuggers, and Ray’s local mode can allow you to use your existing debugging tools in many other situations. Some errors happen outside Python, making them more difficult to debug, like container OOMs, segmentation faults, and other native errors.

Note

Some components of this appendix are shared with the Scaling Python with Dask book, as they are general good advice for debugging all types of distributed systems.

General Debugging Tips with Ray

You likely have your own standard debugging techniques for working with Python code, and this is not meant to replace it. Some general techniques that make more sense with Ray include:

  • Break up failing functions into smaller functions, since ray.remote schedules on the block of a function, smaller functions make it easier to isolate the problem.

  • Be careful about any unintended scope capture.

  • Sample data and try to reproduce locally (local debugging is often easier).

  • Use mypy for type checking, while we haven’t included types in all of our examples, in production code liberal type usage can catch tricky errors.

  • When the issues does appear regardless of parallelization, debugging your code in single threaded mode where it can be easier to understand whats going on.

Now with those additional general tips, it’s time to learn more about the tools and techniques to help your Ray debugging.

Serialization Errors

Serialization plays an important part in Ray, but can also be a source of headaches as small changes can result in unintended variable capture and serialization failure. Thankfully, Ray has a util function inspect_serializability in ray.util that you can use to debug serialization errors. If you intentionally define a function which captures non serializable data, like Bad Serialization Example, you can run inspect_serializability and see how it reports the failure (as in Bad Serialization Result).

Example 1. Bad Serialization Example
link:examples/ray_examples/debugging/Debugging.py[role=include]
Example 2. Bad Serialization Result
=========================================================================
Checking Serializability of <function special_business at 0x7f78802820d0>
=========================================================================
!!! FAIL serialization: pool objects cannot be passed between processes or pickled
Detected 1 global variables. Checking serializability...
    Serializing 'pool' <multiprocessing.pool.Pool state=RUN pool_size=5>...
    !!! FAIL serialization: pool objects cannot be passed between processes or pickled
        Serializing 'Process' <function Pool.Process at 0x7f785905d820>...
        Serializing '_get_tasks' <function Pool._get_tasks at 0x7f7859059700>...
        Serializing '_get_worker_sentinels' <function Pool._get_worker_sentinels at 0x7f785905daf0>...
        Serializing '_handle_results' <function Pool._handle_results at 0x7f7859059670>...
        Serializing '_handle_tasks' <function Pool._handle_tasks at 0x7f78590595e0>...
        Serializing '_help_stuff_finish' <function Pool._help_stuff_finish at 0x7f78590599d0>...
        Serializing '_join_exited_workers' <function Pool._join_exited_workers at 0x7f785905db80>...
        Serializing '_maintain_pool' <function Pool._maintain_pool at 0x7f785905dd30>...
        Serializing '_repopulate_pool_static' <function Pool._repopulate_pool_static at 0x7f785905dca0>...
        Serializing '_wait_for_updates' <function Pool._wait_for_updates at 0x7f78590594c0>...
        Serializing 'Process' <function Pool.Process at 0x7f785905d820>...
        Serializing '_cache' {}...
        !!! FAIL serialization: SimpleQueue objects should only be shared between processes through inheritance
            Serializing 'notifier' <multiprocessing.queues.SimpleQueue object at 0x7f784e54a2e0>...
            !!! FAIL serialization: SimpleQueue objects should only be shared between processes through inheritance
=========================================================================
Variable:

    FailTuple(notifier [obj=<multiprocessing.queues.SimpleQueue object at 0x7f784e54a2e0>, parent={}])

was found to be non-serializable. There may be multiple other undetected variables that were non-serializable.
Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class.
If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/
=========================================================================
(False,
 {FailTuple(notifier [obj=<multiprocessing.queues.SimpleQueue object at 0x7f784e54a2e0>, parent={}])})

In the above example, Ray checks the elements for serializability and also calls out that the non-serializable value pool is coming in from the global scope.

Local Debugging with Ray Local

Using Ray in local mode allows you to use the tools you are used to without having to deal with the complexity of setting up remote debugging. We won’t cover the variety of local Python debugging tools, so this section just exists to remind you to try and reproduce the problem in local mode first before you start using the fancy debugging techniques covered in the rest of this appendix.

Remote Debugging

Remote debugging can be an excellent tool but requires more access to the cluster, something that may not always be available. Ray’s own special integrated ray debug tool supports tracing across the entire cluster. Unfortunately, other remote Python debuggers only attach to one machine at a time, so you can’t simply point your debugger at an entire cluster.

Warning

Remote debugging can result in large performance changes and security implications. It is important to notify all users before enabling remote debugging on a cluster.

If you control your own environment, setting up remote debugging is comparatively straightforward, but in an enterprise deployment, you may find resistance to enabling this. In those situations, using a local cluster or asking for a development cluster to debug on are your best options.

Tip

For interactive debuggers, you may need to work with your systems administrator to expose additional ports from your cluster.

Ray’s Integrated Debugger (via Pdb)

Ray has integrated support for debugging with "Pdb," allowing you to trace code across your cluster. You still need to change the launch command (ray start) to include (ray start --ray-debugger-external) to load the debugger. With Ray’s external debugger enabled on the workers, Pdb will listen on an additional port (without any authentication) for debuggers to connect.

Once your cluster is configured and launched, you can start the Ray debugger on the head node[1] [podname] – /bin/bash. Each cluster manager is slightly different here, so you may have to check your cluster managers documentation.]. To start the debugger, you just need to run ray debug, and then you can use all of your favorite pdb debugging commands.

Other Tools

For non-integrated tools, since each call to a remote function can be scheduled on a different worker, you may find it easier to (temporarily) convert your stateless function into an actor. This will have real performance considerations, so may not be suitable for a production environment, but does mean that repeated calls will be routed to the same machine, making the task of debugging simpler.

PyCharm

PyCharm is a popular Python IDE with an integrated debugger. While it is not integrated like Pdb, you can still make it work with a few simple changes. The first step is to add the pydevd-pycharm package to your container/requirements. Then, in the actor you want to debug, you can enable PyCharm debugging as shown in Enabled PyCharm remote debugging.

Example 3. Enabled PyCharm remote debugging
link:examples/ray_examples/debugging/Debugging.py[role=include]

Your actor will then create a connection back from the executor to your PyCharm IDE.

Python Profilers

Python Profilers can help track down memory leaks, hot code paths, and other important to address, but not error states.

Profilers are less problematic than live remote debugging from a security point of view as they do not require a direct connection from your machine to the cluster. Instead, the profiler runs, generates a report, and then you can look at this report offline. Profiling still introduces performance overhead, so be careful when deciding if you wish to enable it.

To enable Python memory profiling on the executors, you can change the launch command to have the mprof run -E --include-children, -o memory_profile.dat --python prefix. You can then collect the memory_profile and plot them with matplotlib on your machine to see if anything sticks out.

Similarly, you can enable function profiling in your ray execute by replacing ray start in your launch command with echo "from ray.scripts.scripts import main; main()" > launch.py; python -m cProfile -o stats launch.py. This is a bit more complicated than using mprof since the default ray launch script does not play nice with the cProfile, so you need to create a different entry point – but conceptually is equivalent.

Warning

The line_profiler package used for annotation-based profiling does not work well with Ray, so you must use whole program profiling.

Ray and Container Exit Codes

Exit codes are numeric codes that are set when a program exits, with any value besides 0 normally indicating failure. These codes (by convention) generally have meaning but are not 100% consistent. Some common exit codes are:

Exit code Common meaning

0

Success (but often misreported, especially in shell scripts)

1

Generic error

127

Command not found (in a shell script)

130

User terminated (ctrl-c or kill)

137

Out of memory error OR kill -9 (force kill, non-ignorable)

139

Segmentation fault (often null pointer de-reference in native code)

You can print out the exit code of the last command run with "echo $?", or in a script running in strict mode (like some ray launch scripts), you can print this out while still propagating the error with "[raycommand] || (error=$?; echo $error; exit $error)"[2]

Ray Logs

Ray’s logs behave differently than many other distributed applications. Since Ray tends to launch worker processes on the container separate from the initial container startup[3], the stdout and stderr associated with the container will (most often) not contain the debugging information you need. Instead, you can access the worker container logs on the head node by looking for the latest session directory which Ray creates a sym-link to at /tmp/ray/session_latest.

Container Errors

Debugging container errors can be especially challenging, as many of the standard debugging techniques explored so far have challenges. These errors can range from common occurrences, like out of memory, to more esoteric. It can be difficult to distinguish the cause of the container error or exit as container exit sometimes removes the logs.

On Kubernetes, you can sometimes get the logs of a container that has already exited by adding -p to your log request (e.g. kubectl logs -p). You can also configure the terminationMessagePath to point to a file that contains information regarding termination exit. If your Ray worker is exiting, it can make sense to customize the Ray container launch script to add some additional logging. Some common types of additional logging include the last few lines from syslog, dmesg (looking for OOMs), or similar to a file location that you can use to debug later.

The most common kind of container error, native memory leaks, can be challenging to debug. Tools like valgrind can sometimes track down native memory leaks. The details of using tools like valgrind are beyond the scope of this book, so check out the Python Valgrind documentation. Another "trick" you might want to try is effectively bisecting your code; since native memory leaks happen most frequently in library calls, you can try commenting them out and running tests to see which library call is the source of the leak.

Native errors (seg faults, core dumps, etc.)

Native errors and core dumps can be challenging to debug for the same reasons as container errors. Since these types of errors often result in the container exiting, accessing the debugging information can become challenging. A "quick" solution to this is to add a "sleep" to the ray launch script (on failure) so that you can connect to the container (e.g. [raylaunchcommand] || sleep 100000) and use native debugging tools.

However, accessing the internals of a container can be easier said than done. In many production environments, you may not be able to get remote access (e.g. kubectly exec on Kubernetes) for security reasons. If that is the case, you can (sometimes) add a shutdown script to your container specification that copies the core files to a location that persists after the container shuts down (e.g. s3 or HDFS or NFS).

Conclusion

You will have a bit more work to get started with your debugging tools in Ray, and when possible, Ray’s local mode offers a great alternative to remote debugging. You can take advantage of Ray actors to make remote functions schedule more predictably, making it easier to know where to attach your debugging tools. Not all errors are created equal, and some errors like segmentation faults in native code are especially challenging to debug. Good luck finding the bug(s), we believe in you.


1. Ray has the "ray attach" command to create an ssh connection to the head node, however not all head nodes will have an ssh server. On Ray on Kubernetes, you can get to the head node by running kubectl exec -it -n [rayns
2. The exact details of where to configure this change depends on the cluster manager being used. For Ray on Kube with autoscaler you can change the workerStartRayCommands, for Ray on AWS worker_start_ray_commands, etc.
3. Either by ssh or kubectl exec