interpreted vs compiled languages

PHP, you are interpreted. Join the club!

There is some confusion out there about which high-level programming languages are interpreted and which are compiled. And every time someone mentions that PHP is an interpreted one, many PHP developers seem like feeling a bit ashamed about it. Mostly because they think that PHP is the only, or one of the few, programming languages out there that are interpreted.

Let’s put things in an order starting with some definitions:

1. To execute a program, we need to convert the source (programmer’s) code to machine language. So, we need to convert the code from a high-level representation (source code), that humans can understand, to a low-level one that the server’s CPU, the machine, can understand. This low level representation is not the same for all CPUs but depends on the architecture (x86, ARM, Power9, …). So, this low level representation is not portable.

2. Compilation is the process that converts a piece of code into a lower level representation, no matter when we are going to execute it. Generally speaking, this low level representation may be machine language or an intermediate representation. Compiler is the piece of software that makes that conversion. Interpretation is the process that converts a piece of code to machine language (no intermediate representation here) during execution. It converts a line of code, it executes it, it converts the next line etc. So, contrary to compilation, in interpretation the output of the conversion is always machine language. No ambiguity here. And, of course, interpreter is the piece of software than makes the interpretation.

3. A “compiled language” is a programming language that officially provides a compiler to machine language. So, a program written in a compiled language can be converted into a binary executable file. This file can be directly executed by the CPU without any knowledge of the programming language, at all. By “officially”, we mean that this compiler is not experimental but it is stable, complete (it fully supports the language, not just a part of it) and it is widely used by the programmer community. An interpreted language doesn’t have such a compiler and it can only be converted into machine language, line-by-line, during execution by an interpreter.

Now, we have a good basis for discussion. We move on.

As you may have guessed, whether or not a programming language is a compiled language or an interpreted one is not an inherent property of the language. We could build a compiler or an interpreter for any language. How easy this is or whether or not it worths building for each language is another story. However, based on our definitions, C and C++ are considered compiled languages. PHP, Java, Python and Ruby and considered interpreted languages. We will justify that in a second.

There is a popular categorization between machine language compilers. The first category is Ahead-Of-Time (AOT) compilers and the second are Just-In-Time (JIT) compilers. AOT compilers can be used to convert the code to machine language (to a binary executable) at any time. So, ahead of (execution) time. Much ahead! ON the other hand, JIT compilers work hand-by-hand with interpreters. They analyze execution statistics, during execution, and when they find pieces of code that are being executed much more frequent than others, they compile them and keep the compiled (machine) code in memory in order to speed up execution. [1] The argument in favor of JIT and against AOT is that “JIT compilation process allows for all sorts of optimizations that cannot be made in a statically compiled binary, thus enabling higher performance.” [2] On the other hand, JIT compiler has a cold-start problem. It is a trade-off. [3]

Java is an interpreter language. However, it has a compiler (javac) that converts the source code into an intermediate binary format called “bytecodes”. “This byte-code runs on the Java Virtual Machine (JVM), which is usually a software-based interpreter. The use of compiled byte-code allows the interpreter (the virtual machine) to be small and efficient (and nearly as fast as the CPU running native, compiled code). In addition, this byte-code gives Java its portability: it will run on any JVM that is correctly implemented, regardless of computer hardware or software configuration.” [4] It wouldn’t be far from truth to say that Java is both a compiled and an interpreted language (which is claimed by many in an effort to explain how things work and not to categorize the language) but if we need to categorize it, it remains an interpreted one.

PHP is similar. It has a compiler that converts source code into an intermediate binary format called “opcodes”. These opcodes runs on the Zend Engine (the PHP virtual machine [5]). However, there is a difference here compared to Java. In Java, the conversion to bytecodes happens independently from the execution. In PHP, it happens during execution and the opcodes are kept in memory and not exported to a file. The reason is that while there are tools to produce and export (save into a file) opcodes for a specific PHP file [6], these opcode files are not portable. According to Nikita Popov [7], “PHP uses the ‘system ID’ to determine whether it is possible to reuse compiled opcodes. This system ID is used by the existing file cache.” This ties the opcode files to the specific PHP patch release that was used to produce these files (probably to the loaded extensions, too). This is not happening in .NET or Java where bytecodes (and so, JVM) are designed to be backward compatible and, so, portable. The situation in PHP is more similar to the one in Python.

PHP, Java, Ruby and Python use a JIT compiler. For PHP, this is a recent addition (version 8). Java has one since 1996. Python’s main/official interpreter, CPython[8], does not use a JIT compiler. However, Python has another popular interpreter, PyPy[9], that has a JIT compiler since 2007 and, for this, it is much faster than CPython. Ruby has a JIT compiler since 2018 (version 2.6).

PHP has no AOT compiler, neither is planning to get one. HipHop, developed by Facebook, was not a compiler but a transpiler from PHP to C++, though now discontinued. An AOT compiler (jaotc) was added in Java 9 as an experimental feature. Still, it is used as a second step, compiling the bytecode, not the source code, to machine language. [10] However, it seems that the AOT compiler has not been widely used by the Java community and there are thoughts of removing it from JDK. [11] Python seems to provide an AOT compiler beside its JIT compiler. But we see that the use of AOT compilation comes with some limitations.[12] Among others, “AOT compilation produces generic code for your CPU’s architectural family (for example “x86-64”), while JIT compilation produces code optimized for your particular CPU model”. [13] There is also ShedSkin for Python, but it is actually a transpiler. It converts Python programs to optimized C++ and it is still considered experimental [14] Ruby doesn’t have an AOT compiler, too. Just recently, an experimental one has been announced. [15]

References:

[1]. https://www.youtube.com/watch?v=sJVenujWGjs
[2]. https://hhvm.com/
[3]. https://en.wikipedia.org/wiki/Ahead-of-time_compilation
[4]. https://www.sciencedirect.com/topics/engineering/interpreted-language
[5]. https://www.zend.com/blog/exploring-new-php-jit-compiler
[6]. https://php.watch/articles/php-dump-opcodes
[7]. https://externals.io/message/111965
[8]. https://cython.org/
[9]. https://en.wikipedia.org/wiki/PyPy
[10]. https://www.baeldung.com/ahead-of-time-compilation
[11]. https://openjdk.java.net/jeps/410
[12]. https://csrgxtu.github.io/2020/02/09/Lift-Your-Python-Speed/
[13]. http://numba.pydata.org/numba-doc/latest/user/pycc.html
[14]. https://code.google.com/archive/p/shedskin/
[15]. https://sorbet.org/blog/2021/07/30/open-sourcing-sorbet-compiler