Compactness of Tokenized Programming Languages
LLMs are often used to process programming language source code - including both analyzing, summarizing and synthesizing code. Because space in the LLM’s context window is often limited, if one might think it good to choose a programming language that produces compact programs for this application. However, our intuition of what a “compact” programming language is (e.g. J, ALP) might not match what the model sees, because while we see characters, the model sees tokens. Languages like Java, while verbose, should be much more compressible because the repeated keyword and whitespace patterns might be efficiently encoded.
As a quick experiment, I took 50 most popular languages in RosettaCode, took all problems that have solutions in every one of those top 50 languages (at the time of writing there were 87 such problems), and graphed their tokenization statistics using the cl100k_base
tokenizer (used by GPT4 and some other OpenAI models).
First, I checked if my intuition about compression ratios is true. That is, if we define the chars-per-token
metric of a source code block as len(source)/len(tokens)
, would languages like Java be more compressible? Indeed they are:
Given that, the second question was whether stereotypically “compact” languages remain compact when tokenized. Maybe the compressibility is enough to compensate for this? It is not impossible that this could work similar to a phenomenon in natural language, where different languages seem to convey information at roughly the same rate.
However, that does not seem to be the case. If for each language we take its solutions and compute the distribution of their tokenized lengths, the results are as follows:
So even after accounting for tokenization efficiency, languages that are compact to humans also seem to be compact to LLMs.
(Caveat: of course, token density doesn’t necessarily imply high performance with current LLMs. First, denser languages are the less popular ones, and obviously GPT4 isn’t going to understand J well because it saw very little training data of J. But also, there’s a reason humans don’t write everything in APL - density makes programs hard to read. I don’t have quantifiable measurements but I’d suspect that less token density would give more “slack” for the attention layers to do work and capture relevant patterns.)