Publication (Journal article)
2026-04-01
The cost of language: tokenization as a metric of labor
Paolo Caffoni approaches tokenisation through the epistemological problems of translation
Abstract
This study analyzes tokenization in Natural Language Processing (NLP) at the intersection of language, technology, and political economy. Recent studies show that subword tokenization in multilingual Large Language Models (LLMs) distributes processing costs unevenly across languages, as non-Latin scripts are disproportionately fragmented by statistical segmentation. Taking this disparity as a point of departure, the study reconceptualizes tokenization beyond a purely linguistic or computational problem. The analysis approaches tokenization through the epistemological problems of translation and situates it within a longer history of socio-technical infrastructures from telegraphy to machine learning that mediate between linguistic abstraction and regimes of social productivity. Rather than treating tokens as neutral technical units, it shows that they function as operational measures of linguistic value beyond meaning, enabling the quantification and automation of linguistic labor. Drawing on Marxist and structuralist accounts of linguistic value, and postcolonial critiques of semiotic standardization, the study frames tokenization as both a technical abstraction and a metric in the social division of labor.
