Japanese is the most expensive language in terms of input tokens

til
openai
tiktoken
Author
Affiliation
Published

June 27, 2025

Modified

July 3, 2025

OpenAI mentions in their documentation that 1 token corresponds to roughly 4 characters.

I was curious how this would work for different languages, so:

  1. I took a small section of Paul Graham’s How to Do Great Work
  2. Translated it into 7 different languages: English, Spanish, French, German, Japanese, Chinese, and Hindi
  3. Counted the tokens
  4. compared the results.

Code

Here’s the code:

import tiktoken

def read_text(file_path):
    with open(file_path, "r") as file:
        return file.read()

text_en = read_text("../_extras/counting-tokens/en.md")
text_es = read_text("../_extras/counting-tokens/es.md")
text_fr = read_text("../_extras/counting-tokens/fr.md")
text_de = read_text("../_extras/counting-tokens/de.md")
text_jp = read_text("../_extras/counting-tokens/jp.md")
text_zh = read_text("../_extras/counting-tokens/zh.md")
text_hi = read_text("../_extras/counting-tokens/hi.md")
text_ru = read_text("../_extras/counting-tokens/ru.md")
text_pt = read_text("../_extras/counting-tokens/pt.md")

def count_tokens(text):
    return len(tiktoken.encoding_for_model("gpt-4o").encode(text))

chars_count = {
    "en": len(text_en),
    "es": len(text_es),
    "fr": len(text_fr),
    "de": len(text_de),
    "jp": len(text_jp),
    "zh": len(text_zh),
    "hi": len(text_hi),
    "ru": len(text_ru),
    "pt": len(text_pt),
}

tokens_count = {
    "en": count_tokens(text_en),
    "es": count_tokens(text_es),
    "fr": count_tokens(text_fr),
    "de": count_tokens(text_de),
    "jp": count_tokens(text_jp),
    "zh": count_tokens(text_zh),
    "hi": count_tokens(text_hi),
    "ru": count_tokens(text_ru),
    "pt": count_tokens(text_pt),
}

This reads the text from the file, and uses tiktoken to count the tokens. I also counted the number of characters in the text.

Then I calculated the ratio of tokens to characters for each language.

for lang in ["en", "es", "fr", "de", "jp", "zh", "hi", "ru", "pt"]:
    chars = chars_count[lang]
    tokens = tokens_count[lang]
    print(f"{lang}: {chars / tokens:.2f} chars per token, {chars} chars, {tokens} tokens")
en: 4.75 chars per token, 2053 chars, 432 tokens
es: 4.56 chars per token, 2271 chars, 498 tokens
fr: 4.69 chars per token, 2689 chars, 573 tokens
de: 4.46 chars per token, 2479 chars, 556 tokens
jp: 1.41 chars per token, 1081 chars, 767 tokens
zh: 1.33 chars per token, 707 chars, 531 tokens
hi: 3.51 chars per token, 2194 chars, 625 tokens
ru: 4.02 chars per token, 2275 chars, 566 tokens
pt: 4.63 chars per token, 2200 chars, 475 tokens

Results

I found this interesting:

  • English is the most efficient language in terms of characters per token, with 4.75 characters per token.
  • Mandarin Chinese (1.33 characters per token) is the least efficient language in terms of characters per token, followed by Japanese (1.41 characters per token).
  • The same text in Japanese uses 77% more tokens than in English, making it the most expensive language in terms of input tokens.
  • Even though Chinese is less efficient than Japanese in terms of characters per token, it’s more efficient in terms of information conveyed per character. The article took 2053 characters in English, 707 characters in Chinese, and 1081 characters in Japanese. This explains why Chinese isn’t also the most expensive language.
  • Languages that use a latin alphabet (English, Spanish, French, German, Portuguese) are more efficient than languages that use a non-latin alphabet (Japanese, Chinese, Hindi, Russian). Russian is the most efficient language of these, with 4.02 characters per token.

Limitations

This analysis has some clear limitations:

  1. The text might not be a good example of the types of texts you’re working with.
  2. The translations might not be good enough to truly reflect the information conveyed per character.

Citation

BibTeX citation:
@online{castillo2025,
  author = {Castillo, Dylan},
  title = {Japanese Is the Most Expensive Language in Terms of Input
    Tokens},
  date = {2025-06-27},
  url = {https://dylancastillo.co/til/counting-tokens.html},
  langid = {en}
}
For attribution, please cite this work as:
Castillo, Dylan. 2025. “Japanese Is the Most Expensive Language in Terms of Input Tokens.” June 27, 2025. https://dylancastillo.co/til/counting-tokens.html.