viraptor
4 months ago
It's 100% decompiled to C, but not fully labelled yet. That means there's lots it's auto-generated names all over the place. It would be interesting to see someone try to port it now though.
nomilk
4 months ago
Would LLMs be good at labelling, or would the risk of false-positives just waste more time than it saved?
viraptor
4 months ago
I wish someone ran a proper study. In my experience it helps mark some patterns you may not be immediately familiar with, like CRC functions/tables. It also does a good job where no thinking is required, like when you have partial information: "for(unk=0; unk<unk2; unk++) { unk3=players[unk]... }" - you know what the names are, you just need to do the boring part. For completely unknown things, it may get more interesting. But I know I'd like to at least see the suggestions. It's a long and boring work to decompile things fully.
thethimble
4 months ago
Seems like it would be pretty straight forward to fine tune an LLM based on code + asm pairs to help facilitate reverse engineering.
sim7c00
4 months ago
better add IR too, and all the optimized variants of the ASM for the specified code etc. - its not as straightforward, but that depends also on platform. CISC is generally more wacky than RISC i suppose.
also, a lot of things in stuff like ROMs is about I/O to component in the devices, so you can disassemble and decompile all you want but without the right specifications and context you cannot say what the code does.
so it will also need all specifications of the hardware platform you are running the code on, as well as in this case perhaps even hardware in the catridge etc. (heard those also sometimes have their own chips etc...).
i'd say for 'regular application code' that runs within an OS it might be easier, but still you need to provide a lot of context from the actual execution environment to reason properly what the code actually does? (what does INT 80 run and possibly return anyway, that code is outside of your target binary)
user
4 months ago
tralarpa
4 months ago
> I wish someone ran a proper study
There are several scientific publications on this. But I don't think the latest models are available as convenient plugins for IDA or Ghidra. Guessing variable and function names are considered as relatively easy nowadays. Types and structures are the challenges now.
HPsquared
4 months ago
Even comments would be useful ("this function might be doing x, or maybe y")
pizzalife
4 months ago
I use the following IDA pro MCP plugin for this: https://github.com/mrexodia/ida-pro-mcp
tralarpa
4 months ago
I have never used it, but I think the GhidrAssist plugin does that (and more).
mavamaarten
4 months ago
In my limited experience (my use case was a decompiled minified jar that I just wanted to peek around in), LLM's are absolutely fantastic at it.
As with any LLM output, of course it won't be 100% perfect, and you shouldn't treat the output as truthful "data". But I could absolutely use it to make sense of things that at first sight were gibberish, with the original next to it.
madarcho
4 months ago
The task sounds similar to descriptions in the API space. People figured LLMs would be awesome at annotating API specs with descriptions that are so often missing. Truth is, everyone is realising it’s a bit the opposite: the LLMs are “holding it wrong”, making a best guess at what the interfaces do without slightly deeper analysis. So instead, you want humans writing good descriptions specifically so the LLM can make good choices as to how to piece things together.
It’s possible you could set it off on the labelling task, but anecdotally in my experience it will fail when you need to look a couple levels deep into the code to see how functions play with each other. And again, imo, the big risk is getting a label that _looks_ right, but is actually pretty misleadingly wrong.
Cthulhu_
4 months ago
With regards to API specs, if you have an LLM have a swing at it, is it adding value or is it a box ticking exercise because some tool or organization wants you to document everything in a certain way?
If it's easy to generate documentation, and / or if documentation is autogenerated, people are also less likely to actually read it. Worse, if that comment is then used with another LLM to generate code, it could do it even wronger.
I think that at this stage, all of the programming best practices will find a new reasoning, LLMs - that is, a well-documented API will have better results when an LLM takes a swing at it than a poorly documented one. Same with code and programming languages, use straightforward, non-magic code for better results. This was always true of course, but for some reason people have pushed that into the background or think of it as a box ticking exercise.
theptip
4 months ago
My take on AI-for docs is - it’s good. But you need to have a human review.
It’s a lot easier to have someone who knows the code well review a paragraph of text than to ask them to write that paragraph.
Good comments make the code much easier for LLMs to use, as well. Especially in the case where the LLM generated docs would be subtly misunderstanding the purpose.
HPsquared
4 months ago
LLM-assisted reverse engineering is definitely a hard problem but very worthwhile if someone can crack it. I hope at least some "prompt engineers" are trying to make progress.
0x0000ff
4 months ago
Given the source code for the build engine, which was ported to the N64 and used to make the game, is freely available for non-commercial use, could it be used to map some of the function and variable names?
mikkupikku
4 months ago
> build engine, which was ported to the N64 and used to make the game
I don't think that's what they did. Looking at some gameplay footage on youtube, it's a third person game with a full 3d player model, not flat sprites, and the level geometry seems to be full proper 3d without the build engine distortions when looking up and down. I think they built or used a different engine designed to take advantage of the N64s graphics hardware.
aruametello
4 months ago
> (...) I think they built or used a different engine designed to take advantage of the N64s graphics hardware.
Something around that, they used the build engine as a starting point but "hacked it to oblivion". On the very least it reuses the same level editor (maps are vanilla build editor compatible) and it does keep many of the old bugs, like "killer doors".
semi random source: https://forums.duke4.net/topic/9513-release-zero-hour-resour...
incidentally the predecessor "duke nukem 64" is already more akin to what gzdoom is by using polygons instead of 2.5 rendering for walls and floors and they decided to push for polygons in the actors too for "zero hour" release.
sim7c00
4 months ago
this might have some limited uses, but you'd need to know how it was optimised, and also perhaps build the build engine for the same target and then decompile it to see what it looks like after that kind of treatment.?
perhaps with a bit of luck you'd get some useful markers / functions mapped tho, its not unheard of.
problem in my mind (didnt test it ofc) would be that the decompiled version is decompiled from a different ISA that build usually compiles to, so the decompiled version in my mind would look totally different. (you dont have the ported sources i suppose, only the originals).
userbinator
4 months ago
With things like Ghidra now freely available, "100% decompiled to C" really isn't that high of a bar anymore.
dlcarrier
4 months ago
Usually it means it has bit-perfect recompilation, which can take a lot of work.
foldor
4 months ago
This is very wrong. Ghidra might decompile to some C code, but this is a completely different, and very high bar. This is 100% matching C code that compiles to the exact binary the game shipped with. This means it's able to be modified, tweaked and upgraded. Ghidra is helpful to do some research, but it won't compile.
userbinator
4 months ago
That's not what the parent comment said: "100% decompiled to C, but not fully labelled yet".
Getting the exact same binary is definitely something else. Figuring out the exact compiler version and options, libraries, etc. is itself nontrivial.
chii
4 months ago
i think this sort of scenario is where an ai code comb through is going to add value - get an ai to guess the likely name of variables/functions based on usage and context etc. It'd make a great starting point for manually labelling each variable correctly.
skerit
4 months ago
I'm using Ghidra with Claude-Code to reverse engineer an old transportation-sim game, and Claude is very good at figuring out what a function does and what it and its variables should be named.
jonhohle
4 months ago
I’d argue using AI to provide that input is no longer a creative work and puts the output in the realm of being not transformative. Those are two of the main reasons these projects are not violating copyright and all of the legal risk that comes with decompiling.
chii
4 months ago
It's currently legally grey as to what ai output constitutes.
I'd argue that copyright should not be affected by the tools being used. If the project would've been compliant with copyright (ala, fair use) had it used human hands, then using an ai tool makes zero difference.
jonhohle
4 months ago
USCO disagrees. Output of an LLM is not considered a new, original work. https://www.insidetechlaw.com/blog/2024/05/generative-ai-how...