Reading Assembly is challenging. Students in Professor Malte Schwarzkopf’s CSCI 0300 computer systems course at Brown University, for example, are required to explain the functionality of an Assembly program during exams. Assembly can be produced during the process of compiling a C program into an executable file. Going the opposite direction—converting Assembly instructions to human readable C code—is much harder. This repository represents our attempt to train a neural Assembly to C decompiler based on the transformer architecture.
To read more about our methodology and results, take a look at our final paper or the slides from our final presentation. To try out the decompiler yourself, our demo website can be found at neuraldecompiler.pythonanywhere.com.
The following online resources were utilized in the data generation process.
- LeetCode: https://leetcode.com/
- CompilerExplore API: https://github.com/compiler-explorer
- ChatGPT: https://chat.openai.com/
- TensorFlow Tutorials: https://www.tensorflow.org/text/tutorials/transformer
This project was inspired by the following papers.
- Baptiste Roziere. Unsupervised Translation of Programming Languages. Neural and Evolutionary Computing [cs.NE]. Université Paris sciences et lettres, 2022. English. ffNNT : 2022UPSLD015ff. fftel03852612
- Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.
Model:
- NumPy
- TensorFlow >= 2.8.0
- NLTK
Demo Site:
- Flask
- Flask-Cors
NeuralDecompiler was developed by Andrew Yang, Tiger Ji, Frank Chiu, and Justin Cheng to fulfill the final project requirement of CSCI 1470/2470 - Deep Learning, which was taught by Professor Ritambhara Singh at Brown University.
Andrew Yang contributed to the preprocessing and tokenization of training data. Andrew also implemented and trained the transformer model.
Tiger Ji contributed to the compiling process of C to Assembly as well as the cleaning of C and Assembly files. Tiger also wrote the script to rename function, struct, and variable names in the C files. Finally, Tiger wrote the client facing code.
Frank Chiu contributed to the scraping of Leetcode C user solutions. Frank also contributed to the renaming process of function, struct, and variable names in the C files.
Justin Cheng contributed to the ChatGPT script. Justin also aided Andrew, Tiger, and Frank in the overall development, debugging, and training process.