-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debugging Virtex DSP48E1 bug #445
Comments
First, Vishal found most of the bugs it seems -- all that remains are bugs in the combinational case. That is, if we synthesize combinational designs, the result from Lakeroad does not function as expected. Interestingly, the differences can be subtle, too. For example, it took a number of iterations for random testing with Verilator to find a case that breaks when testing a 9bit mul-sub case. |
2024-06-26Trying to binary search the bug. Process:
Returning just the input A works. Returning multiplication of A * B works.
I thought I found the bug, but still getting bugs in eval. At least one of the tests that's failing in the eval is passing when I make it an integration test in LR. Unsure why that would be, but it certainly feels like it could be a mismatched Lakeroad version or something. Furthermore, the tests are passing when I run them on leviathan, within the lakeroad submodule of the lakeroad eval repo. weird. Yeah, every test i'm adding to Lakeroad that's failing in the eval seems to be passing in Lakeroad itself. Something's weird. List of tests not passing:
Gonna run an eval run in Docker and see if it replicates there. |
2024-06-27Seems like it's still running into errors in Docker. So I think there's two options at this point
The second one is easier to check, so i'll do it first, but at this point it really does feel more likely that it's the first. Oh interesting. Before, I tried disabling all solvers but Bitwuzla, so that I could get the eval to match what was happening in the integratio ntests (because all of my integration tests were using Bitwuzla). But when I ran the eval with just Bitwuzla...well, what happened? I forget, but I think what happened was that I couldn't replicate the issues? Oh, no, according to this thread with Vishal, it was still failing when I just used Bitwuzla. Anyway, what just happened was that I made an integration test that uses STP instead of Bitwuzla, and it fails. So I finally have the case I was looking for, which is good. Now I can go back to the delta debugging I was doing earlier. I just went through the diffs. Here's the thoughts i had.
Easy check of the second point -- just remove the delay in the Verilator model and see what happens. interestingly, it's still an off by one error. I'm almost certain the bug is in the carry logic. There's some weird stuff happing there. Specifically I think I just realized that qcarryin_o_mux0 isn't driven in the modified-for-racket version, which is definitely not right (it's driven in the original version). I might have deleted that code---wouldn't be surprised.
Ok, there was definitely a bug with qcarryin_o_mux0 not being set - just a typo bug. That's resolved. Still bugs left though. Running the eval and things are still failing. E.g. Again, still an off by one error. So I suspect it's more bugs in the carry. There are still latches being inferred, too.
Killed the latches. I should really include somewhere that yosys is a very helpful debugging tool when doing this process. Still running into bug. Interestingly, the multiple drivers warning also disappeared? That doesn't feel right. I don't feel like I fixed that issue. |
2024-06-28Realizing I did solve the multiple drivers problem when I fixed the typo bug. Got another bug to replicate in integration tests: Interestingly, fails with STP but not Bitwuzla. So theyre just finding different solutions, I assume. |
2024-06-29My debugging strategy now is to put $displays in the Verilator model to see where in the pipeline the value diverges from what i expect. Frankly I probably should be using a waveform viewer for this. Unsurprisingly, qcarryin_o_mux is getting set to 1 at some point. That's definitely the problem. Ok, I think the problem is coming from the fact taht we changed
to
If I recall, we did this because of a combinational cycle that was caused by using qp_o_mux. We thought through it though, and i think this change should have been legal based on some constraints that should be in the design. Found the bug -- the constraints in the arch description for carryinsel were wrong. They'd been copy-pasted and not updated. |
2024-07-01Apparently not done with debugging yet. One of these tests is failing in CI. At this point, might be most efficient to just look through the constraints for any more obvious bugs. No obvious bugs. Back on the hunt for bugs in the code itself. Specifically just adding $display statements into the simulation model. For this singular example, it's correct up to qmult_o_mux. So bug is presumably still after that. I wonder if it's still carryin related. z mux is C, x is the result of the mul, y is zero. so those all check out just fine. It seems to be an issue with the P register. It seems like the racket import version thinks that the ALU output will make it to the output, but it doesn't. The correct output is presumably at alu_o, but doesn't make it through to PREG. I wonder if vishal just didn't find all of the constraints on PREG? They're encoded in a very, very annoying way in the simulation file. I wonder if they're more compact in the pdf. He got all of the ones I found when I just did a pass. |
2024-07-02IT seems like the PREG is working fine---I think there might be a register in the middle somewhere that's enabled when it shouldn't be. C is making it through to the ALU instantly. But the result of the multiply does not -- it seems to be only making it to the ALU Wait nevermind -- it seems like an answer is coming out of the multiplier before the clock ticks, but it's not correct. then when the clock ticks, the multiplier (qmult_o_mux) signal gets the right answer (minus the post-add). But then because omg, it was because i had different --pipeline-depths set for synthesis and simulation. This would have been fixed by #442. Wasted a whole day on this... |
Tracking issue for finding the bug in the DSP48E1 model.
Some things to try in debugging:
Other subtasks to finish this up:
The text was updated successfully, but these errors were encountered: