Abstract

Bias in large visual models is not just a question of what is represented but of the logic of representation itself. ImageNet – one of the most popular previous-generation datasets – thus sees the world as a collection of singular, industrially manufactured consumer goods. But the mapping from training data to trained model is always messy and indirect. If we thus want to better understand the place of large visual models within our contemporary visual culture, we will have to ask more difficult questions about the ideologies (and, it turns out, epistemologies) of the “black boxes” themselves.

Following Phil Agre’s claim that artificial intelligence is “philosophy underneath”, in this talk, which is based on our forthcoming book, we show how the “philosophy” of contemporary artificial intelligence is not to be found anymore in the famous thought experiments of Turing and Searle, but, surprisingly, in a long chain of historical attempts to compress (visual) knowledge that reaches from the first formalization of vector mathematics and dependent probabilities in the 19th century, through 1980s computational biology research, all the way to the multimodal models of the 2020s.

What we uncover in doing so is a significant and previously little understood technical paradigm shift in artificial intelligence research that continues to shape the ideological function of these models. Their capabilities, we argue, are closely tied to the rise of a specific machine learning technique called “embedding”, a technique that has not been studied from the perspective of critical artificial intelligence studies so far. Embeddings, starting in the 1990s, are thought of, and implemented as an abstract geometry for not only representing, but producing knowledge – a development that, surprisingly, is described by the computer scientists themselves as a post-structuralist turn. Vector mathematics – of the kind employed in the famous analogy query of natural language processing, “men + king = woman + ?” – is imagined as a mathematical tool to transcend the training data and reintroduce meaning into the embedding space. While, on the surface, it seems like technocratic notions of human intelligence determine the claim to power of artificial intelligence, it is in fact this epistemic shift that historically structures it. It is thus the epistemology of embedding that ends up shaping, even determining, the epistemology of artificial intelligence in general.

We show how this technical-ideological turn in machine learning eventually leads to the multimodal models of today, models that seem to transcend the media boundaries of the objects they ingest. Text, image, and audio can all be represented by the same model, as just another embedding. We theorize this tendency towards media collapse – a centrifugal pull of commensurability that dissolves media-specific cultural objects into embeddings – as the rise of neural exchange value: value that specific cultural objects obtain once they become part of a multimodal embedding space. We conclude by proposing the study of embedding spaces, and their machine visual culture, as a necessary complement to the study of datasets.