-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Debugging in FDB
Here are some helpful debugging tips when you encounter with simulation failures.
-
Why I cannot reproduce failures reported by Joshua on my local machine?
- Simulation tests are only reproducible on the same OS type, and with the same build environment. A binary from a CMake build in docker may not execute the same as a CMake build in the centos7 VM. Libstdc++ vs libc++ makes determinism change.
- Make sure your test is deterministic by checking the unseed numbers are same across multiple runnings. (The seed initializes the random number generator for the test, and then the unseed is the last random number generated.)
- Verify the
trace*.xml
is well-formatted. If you know your test is successful on your local machine, check the lasttrace*.xml
file using command likemono ./correctness/bin/TestHarness.exe summarize trace.0.0.0.0.0.1595012319.MJbTZt.0.1.xml summary.xml "" "" false
. See whether summary.xml has something the same as your Joshua report. (Maybe open the XML file in the browser directly also help.)
-
How to use
valgrind
to help diagnose memory problems?- Edit
CMakeCache.txt
. SetUSE_VALGRIND=ON
then recompile fdbserver. Thenvalgrind --log-file=valg.log ./bin/fdbserver -r simulation -s 366751840 -f ../tests/fast/SwizzledRollbackSideband.toml
can work. - If illegal instruction error appears when you run
valgrind
(saysvex amd64->IR: unhandled instruction bytes...
), setUSE_AVX=OFF
. - You can also submit
packages/valgrind-7.0.0.tar.gz
to Joshua. It takes more time to finish one assemble.
- Edit
-
How to choose a subset of tests to run on Joshua?
- Edit
TEST_INCLUDE:STRING, TEST_EXCLUDE:STRING
. Notice that the regex doesn't have.toml
suffix, and the path can be related to${FDB_REPO_ROOT}/tests
(ex.fast/.*
), or not (ex.Swizzled.*
means all test files under subfolders started with Swizzled). - Only test spec files under
${FDB_REPO_ROOT}/tests/{rare,restarting,slow,fast}
will be choosen. The files under ./tests/ are some tests that used to work or be useful but not anymore.
- Edit
-
How to get useful information in
trace*.xml
, considering there're too many lines?- Start grep Trace events from
fdbserver/tester.actor.cpp
and locate you failed in which test phase. - Function
waitForQuietDatabase
inQuietDatabase.actor.app
also contains many useful trace message in simulation test.
- Start grep Trace events from
-
Get you to know which line causes the problem.
- use coredump files. (ex.
gdb ./bin/fdbserver core.17426
) - use addr2line. The problematic trace stack is logged in trace files and stdout. (ex.
addr2line -e ./bin/fdbserver -p -C -f -i 0x7fae93006630 0x26b5f98 0x26eff9b 0x26dc132 0x26df565 0x26dfaea 0x25c8798 0x25c4f45 0x25c20d8 0x25c1e0e 0x25c0118 0x25b816c 0x25bf888 0x25bf4ba 0x25bc968 0x25bc27a 0x1eb6148 0xf34538 0xf34316 0x287a431 0x287a1f1 0xfc6c78 0x28edc37 0x28eda96 0x28ed731 0x28edae2 0x28ee16c 0xfc6c78 0x29d1016 0x29c5bfe 0x28df844 0x13f3c18 0x7fae92745555
- use coredump files. (ex.
-
A restart test failed. How do I reproduce it locally?
-
Try to set
TestConfig.simpleConfig=true
to make the fdb cluster as a small cluster to make debug easier. -
MutationTracking.h
andMutationTracking.cpp
file contains useful functions to track all mutation activities of up to 2 keys in simulation test. Try to set the key you want to track, andgrep MutationTracking
. -
Use
fdbserver --print-sim-time
to print out the simulation seconds if you experience a longtime_out
failure. Kill the simulation after you think the debug already happened after a certain simulation interval. -
Use
DEBUG_ERROR
macro to enable backtrace when a specified error is created. -
Use
tr.debugTransaction()
together withdebugRandom()
to enable the trace for a specified transaction without change the deterministic randomness of a simulation failure. -
There are several compilation options can enable sanitizer utilities provided by
clang
. For example,USE_ASAN
,USE_TSAN
...