Home > Enterprise >  Can machine code be transpiled to jvm bytecode?
Can machine code be transpiled to jvm bytecode?

Time:07-01

I was wondering, the jvm is designed to translate bytecode to machine code, but I can't seem to find any information on whether the inverse operation is even possible. it seems like if it was possible it would happen more often, as it would allow you to run any binary anywhere the jvm runs.

so, obviously, there's something I'm missing here, but I'm not sure what. of course there are apps that would depend on system-dependent frameworks, but for simple apps that depend solely on, say, the c standard library, it seems like it should be possible? does anybody know why this sort of thing simply isn't attempted?

CodePudding user response:

It is not practical to convert an arbitrary machine code binary into JVM byte code.  For example, there are instructions in C programs & libraries that have no equivalent in JVM byte code.  You would have to divine the meaning of the program from its binary and translate that intent into something.  In the general case of an arbitrary machine code program, we don't know how to automate this.

What you could do instead, however, is write a machine code simulator in Java that would interpret machine code programs and system calls for all possible instruction sets and operating systems — you could then take any program binary and simulate it using the JVM!  We know how to do this: it is a simple matter of programming.

It would be a lot of effort (probably hundreds of thousands of person hours) and have the obvious downside of performance degradation: we would expect programs to run at a very, very small, tiny fraction of their native execution performance.  It would also risk numerous bugs and require extensive testing, so, in short, a big project.

CodePudding user response:

Your question is interesting because people usually ask the opposite, "can a Java (bytecode) program be compiled to a native executable", thinking Java is too slow. Java (VM) is not slow these days, it just consumes more memory and power compared to what a carefully optimized native program may require, and a bit slower.

You also asked about a specific case - simple apps that only depend on the C standard library. I thought about it, and let me explain what problems translating such programs to JVM bytecode may involve.

My first programming was Java, but I never dealt with JVM seriously, so some details might be wrong.

You have to implement the standard C library (libC). This is relatively easy if you can depend on the Java standard library, which has much wider coverage than libC. Without the Java dependency, you will have to implement native methods for each platform because JVM doesn't provide any built-ins to interface with the user, like console IO.

Now, you have to detect where the arguments of a libC call or any function call is stored. It will be in a register or a specific memory area depending on the ABI. Often, it's as simple as r0 = 1; r1 = 2; call f(r0, r1), but sometimes the value of r0 could have been determined somewhere way back in code. You need a mechanism to find out when r0 was finally modified before the call.

The program will never be a series of calls to libC. It will have all sorts of operations that should be translated for JVM. Modern CPUs in commercial use are register machines, but the JVM is a stack machine, or actually a hybrid machine.

The basic arithmetic operations are done on the stack, but a method call creates its own frame. Each frame has an array of local variables whose size is determined at compile-time, in a similar manner that a native method has its own stack frame. When you call int add(int a, int b) for example, this method will have a frame with an array of local variables that contains the two arguments at least. Then, those variables will be pushed to the stack with a load instruction. An add instruction will consume the two variables on the top of the stack and replace them with the result. You have to transform the instruction sequence of typical register machines to this hybrid-machine JVM bytecode.

In conclusion, I think it's doable with a lot of time and effort, but note that this is about the simple case of simple apps that solely depend on libC. I'm not sure if such project is worth the human investment, but some interesting discoveries could be made during the process.


I had an assumption in mind that a program that only depends on libC would have been compiled from a C program. If this is not the case, probably a general solution is not available. It's possible to write intentionally obfuscated assembly code to make a decompiler fail.

  • Related