AI Linguistics

How to turn free text into a structured SQL-like database using thematic roles

Dec 12, 2023 | 4 min read | by Yugen Omer Korat

People often ask me if a background in theoretical linguistics translates into any practical utility for language-based AI development. My usual answer is that it’s hard to tell, and there are opinions both ways. Even if it does translate, it would often come in subtle ways that are almost impossible to measure.

There are rare times, though, when the contribution of linguistic theory to AI practice is undeniable, and the application of thematic roles to information retrieval is one of them.

As we’ve established in the previous article, natural language is messy. Since the days of Aristotle, people have been looking for sets of rules that embed (no pun here) natural language in a well-defined, logical space. Until ChatGPT came out, no such attempt has been quite successful in capturing all the cognitive, social, and rhetorical nuances in a way that yields naturally sounding artificially generated language.

As a result, information retrieval systems had to do with proxies. When the user types a query, you embed it into some high-dimensional space and then extract the documents with the geometrically closest embeddings, hoping that there will be a connection. This is useful in many cases, but it is very far from a logical SQL-like database where every sentence is mapped to a maximally compressed single meaning as you can do with tabular data.

What tabular data has that natural language doesn’t is structure: the columns tell us the role of each component in the row. And with modern chatbots, we can now give such structure to free text with a precision that had not been possible prior to the Intelligent Age. And there is a particular theory in linguistics that lends itself quite naturally to the assignment of such structure: Thematic Role Theory!

We all know that sentences have different components, like verb, subject, direct object, indirect object, etc’. However, these titles are defined syntactically. In English, the subject always appears first, followed by the verb, and followed by the objects (at least canonically, let’s not open that can of worms). Different languages have different orderings, which depend on their syntax.

However, what syntax doesn’t tell us is the role of a certain component in the event described by the sentence (=its semantics/meaning). The classic example is in passive verbs:

John ate the apple
The apple was eaten by John

Both of these events describe the same event (=have the same semantics/meaning), but their subjects and objects are inverted. So we need terminology to describe what John is and what the apple is, in a way that doesn’t vary between these two sentences. And this is what thematic roles are for.

We say that John is the “Agent” of the sentence, because he is the one doing the action, and the apple is the “Theme”, which means it is the entity being acted upon. There are other titles for almost any role you can imagine an entity can take in a sentence, like Patient, Instrument, Goal, etc’, and by assigning these roles to phrases we are essentially tabularizing linguistic data, which can potentially lead us to the logical SQL-like database mentioned above.

And if you’re still with me, you can tell that this is exactly the kind of task that would have been impossible earlier and is now quite easy because no amount of logical formulations suffices to automatically infer from arbitrary text the participants and their roles. Beyond just linguistic faculties, it requires substantial reasoning and world knowledge, which is the leap made by ChatGPT. And of course, it can do it in a fraction of the time and cost it would take a human annotator.

Just to give you a sense of how difficult this task is, based on my testing, even ChatGPT3.5 performs horrendously at it, and the jump to 4 is necessary here, which says a lot by itself. So of course in the olden days, it was unimaginable.

Anyway, long story short, we’ve written a prompt that takes free text and generates thematic role representations that look like the title image of this article, and a script that converts the result into a structured database. You can take a look here , and in the next article we will publish a code breakdown to go through the details. We will also show how to use this database to generate likelihood assignment to events which is more nuanced than what you can do with straight-out language models.

The whole series on AI in Linguistics

by Yugen Omer Korat

Yugen is a co-founder and CTO of Marvin Labs. He was a postdoctoral researcher at Háskóli Íslands University, holds a PhD in Computational Linguistics from Stanford University, and MA and BA from Tel Aviv University.