AI Linguistics

Code Breakdown: Thematic Roles

Dec 21, 2023 | 7 min read | by Yugen Omer Korat

In the previous article, we've discussed thematic roles, what they are, and how to use them to create a structured database out of free text. This article explains how to create such a database and use it in your application, focusing on the rationale behind the code rather than the technical details. For the full documented code see create_th_db.py and process_th_db.py in our ML examples repo .

create_th_db.py is responsible for turning text into events. Events are dictionaries mapping each thematic role into its content, that is, the phrase that takes that role, and maps the field ‘Predicate’ into the main verb of the event. In the case of the sentences above, the representation would be:

{
  "Predicate": "ate",
  "Agent": "John",
  "Theme": "an apple"
}

To achieve this, we use prompting with ChatGPT-4, where we ask him very politely to do this for us. Look at the actual prompt at prompts/system_thematic.jinja2 to see how it’s done.

In process_th_db.py, the event format is first converted from raw json into native Python. This happens in the lines preceding the creation of the raw_db variable, which stores a list of dictionaries, with each corresponding to one event. It can now be used like any Python list of dictionaries, or converted into a dataframe for any purpose you might usually need tabular data. Missing thematic roles are simply mapped to None in this case.

There are lots of useful things you can do with this database. As I explained above, turning free text into any kind of tabular format has been essentially impossible until recently, so the fact that ChatGPT-4 is able to do it in a consistent way is quite an achievement. You can use it like you would use any other dataframe, to perform logical operations and structured search like you would do with an SQL database. For example, you could look for all events that describe a patient reporting symptoms of migraines, or of companies announcing a new feature. So a lot of tasks that required pre-tabularized data can now be performed on free text. This is very useful, but there is little value in me showing you how to do it since it’s just basic dataframe handling.

I will, however, elaborate on an especially noteworthy application of this technique: identifying the likelihood of events in the database relative to each other.

You might recognize this as the idea behind the calculation of perplexity, one of the most basic ways to evaluate language models, namely a method to compute how surprised (perplexed) the model is by each of the tokens in a given text. As a general rule, the less surprising actual data is to a model, the better a model it is. But while perplexity is defined over atomic tokens, our model’s basic unit, on the other hand, is an entire event, which has its own internal structure and hierarchy. This means that the likelihood is calculated based on multiple channels, one for each thematic role. This adds an entire dimension of not just the likelihood of the phrases themselves, but also the combinations between them. You can think of these representations as a form of hierarchical embeddings (with 1 level of hierarchy in this case, but we could go deeper if we wanted), with semantics-driven partitioning.

As is often the case in data science, the raw data has too much variation for our purposes. There are many phrases that the distinction between them is almost meaningless for event likelihood. To take some actual examples from financial report text (which is what we currently work on at MarvinLabs), modest differences between numerical quantities like “5.012%” and “10.0%” are not indicative of any sort of practically important distinction, because the phrases would be used in virtually the same contexts. You can also find many kinds of expressions that make distinctions that are way too specific and narrow to be of any interest, like “incremental revenue from TracFone” and “increases in wireless service revenue and Fios service revenue”. Yes, TracFone and Fios are different things, but for purposes of language modeling, this is simply not relevant, just like the difference between “John” and “Bill” when telling a story. Also, even though the phrasing is different they both talk about revenue increase from a service, so we want to be able to capture the fact that they convey the same kind of idea somehow.

The usual way to do it is by clustering noun phrases (NPs) and predicates, which assigns the same label to phrases that are similar enough and thus dramatically simplifies the data for all applications. I have a future article planned about clustering if you want to dive deeper, but for our purposes here, I will give a high-level explanation that suffices for our purposes here.

The first step in the clustering process is to collect most NPs and all predicates separately, and embed each group using a language model, which gives us the variables nembs and pembs. In our example, we are using Huggingface's distilbert-base-uncased, which is one of the simplest yet very effective embedding models standardly used across the board, but this can change depending on the application. This is done using the embed function.

The NP and predicate embeddings are then clustered separately, which is our way to simplify them. In our example we use the KMeans clustering algorithm, with the number of clusters being chosen rather arbitrarily (I simply picked values that I thought make conceptual sense, but I’m not showing you how to optimize it here because it strays too far away from the core idea of this post).

After having mapped each NP or predicate to its corresponding cluster, we have a much smaller sample space over which to estimate event likelihood (notice that the word “event” here happens to be both the mathematical term and the actual thing whose likelihood is being measured, just to avoid confusion), which is generally great news. We are now ready for the last step, where we assign the actual likelihood based on the compressed event representations.

There are multiple ways to approach this, but for today let’s go with arguably the simplest one: for each thematic role, we calculate the probability of each cluster taking that role, and for each pair of semantic roles, we calculate the probability of each pair of clusters appearing in it. Probability is calculated by counting how many times a certain combination of role(s) and cluster(s) happens, and dividing by the overall number of times an event had this role (these roles). That is, out of how many times a certain combination had the chance to occur, how many times did it actually occur. We do this over the entire database using the function calculate_probabilities.

With this approach, the likelihood of each event is simply defined as the product of the probabilities of all of its role(s) and cluster(s) combinations. Again, an almost brain-dead approach, but it can give some reasonable results with zero assumptions about the data. Of course, if you want to optimize this metric for your application, you will have to do some more in-depth analysis of key terms and their distribution in your dataset, and then weight the likelihood computation accordingly. This is a huge topic, and far beyond the scope of this article, but we will get to how to approach it in future posts about applications of linguistic theory to topic modeling and distribution of semantic fields.

As mentioned above, now you can use the likelihood of events for various purposes, like you would do with perplexity, except the base unit is an entire event rather than a token. For example, you can pick the less likely events if you want to find what surprising facts are stored in your database, or you can train a model that predicts the likelihood of an event from its linguistic description, thus learning facts about a certain domain as encoded in the way people talk about it.

The whole series on AI in Linguistics

by Yugen Omer Korat

Yugen is a co-founder and CTO of Marvin Labs. He was a postdoctoral researcher at Háskóli Íslands University, holds a PhD in Computational Linguistics from Stanford University, and MA and BA from Tel Aviv University.