Foundational LLMs for Decompiled Code
Advisors: Professor Lin Tan, PhD students Nan Jiang and Danning Xie.
In this project, we aim to develop a foundational model for use in refinement of the output of decompilers. Previous state-of-the-art approach LLM4Decompile is trained only on output of the Ghidra SRE tool, and we hope to build on the generalization capabilities of the model for use with output of the Hex-Rays decompiler accessed through IDA Pro. I am thus aiming to explore efficient self-supervised methods and architectural adjustments for finetuning of their released model. Gathering and decompiling code from Github repositories has been a major bottleneck in this project, combined with slow training of very large models, further indicating to me the importance of model and data efficient approaches.
