Analyzing the News, with KickstartAI

Recently, Kickstart AI hosted a workshop on using topic modeling to analyze the news. The workshop was titled "News Analysis for Poverty Insights in NL", and it was held on the 26th March, at House of Watt in Amsterdam. I hosted and presented at the workshop, together with my colleagues from Kickstart AI, including Dr. Carmen Adriana Martinez Barbosa, Omendra Manhar, and Kim Veltman. The session delved into using AI for news analysis and its impact on poverty insights.

Why the news? Because we often view society through the news. It shapes the way we see the world, sways our voting choices, and impacts our mood. I see news as a goldmine of data that's yet to be tapped. Unlike typical datasets neatly organized in tables, news data is mostly unstructured. This complexity once made analysis tough. But now, thanks to advances in NLP, we can mine valuable insights from it. That’s exactly what we wanted to explore. We worked with a dataset of 250,000 articles from the Dutch news agency, NOS, and showed how tools like TF-IDF and BERTopic can help discover themes and patterns in the news. We also linked this news data with macroeconomic perception indicators from CBS, exploring how these could intersect.

This fits into a broader project we’re working on, which we discussed in a previous blog post. Where we explore whether we can map poverty, in collaboration with Dutch food banks. Given the lack of frequent, detailed data on poverty, we were curious whether the news might offer a way to track changes in public perceptions of poverty.

‍

Why Topic Modeling Matters

If you're wondering why we apply topic modeling, it's simple: it helps us sift through vast amounts of data when we don't know what we're looking for. For instance, with a standard dataset like the famous iris flower petal dataset, you can use PCA to reduce dimensions and cluster the data. This reveals patterns and trends, even without prior knowledge of the data's specifics.

However, news articles aren't straightforward numbers of course. They are texts. So to analyze them, we first need to convert these texts into numerical vectors. By transforming texts into vectors, we can treat and analyze news data like any other dataset. This allows us to measure how close different news articles are from one another. This opens up ways to see patterns and connections that aren't immediately obvious.

Here’s how it’s done:

‍

Turning Text into Vectors

Bag-of-Words

One basic approach is the bag-of-words method, where we count how frequently each word appears in the texts. It's a useful starting point, and underpins several topic modeling techniques like LDA and NMF. However, it ignores the order of words and treats similar words as distinct unless pre-processed.

For example, imagine that we have a dataset that contains the following sentences:

Then we get the following bag-of-words representation:

w = ["a", "and", "cat", "cute", "dog", "is", "the"]

And the following frequencies:

In this simple example, bag-of-words works quite well. If you would calculate the cosine similarity between these vectors, you'd see that sentences 1 and 2 are the closest to each other, and sentences 0 and 1 are the farthest away. However, bag-of-words breaks down in sentences like this:

Both of these sentences are the same according to bag-of-words, even though they are diametrically opposite.

‍

Encoder Large Language Models

Modern NLP offers more sophisticated tools. Encoder models, like BERT, understand text context by predicting missing words in sentences. This self-supervised training helps them grasp the semantics and structure of language. By averaging the vector outputs of these models, we get a 'context vector' that encapsulates the essence of a text. By predicting randomly masked-off words, an encoder model learns about the underlying structure of the text, as well as knowledge about the world.

In the above visualization, BERT 9'th layer attention maps are visualized. We can see that to predict the masked-off word 'Amsterdam', BERT pays special attention to the words 'capital' and 'Netherlands' in this layer. If you want to try out this visualization tool yourself, visit URL.

‍

BERTopic in Action

Using encoder models, we can transform texts into context vectors, reduce their dimensionality, cluster them, and use TF-IDF to pinpoint what each cluster represents. This process, known as BERTopic, helps us explore and categorize large sets of news data effectively.

In the workshop's first exercise, we focused on understanding media perceptions of poverty over the years. This topic is particularly relevant given our collaboration with the Food Bank. Using TF-IDF, we pinpointed key terms uniquely associated with articles that mentioned 'poverty' each year. For instance, in 2022, terms related to high electricity and gas prices were prominent, highlighting the economic effects of the Ukraine war on Dutch households. This exercise illustrated how a simple technique like TF-IDF can still yield interesting insights.

The audience participated well in this during the workshop, and suggested ways to enhance results, such as by applying preprocessing techniques like lemmatization. This led to a lively discussion about why LLMs often skip these preprocessing steps, which added an extra angle to our discussions on NLP.

‍

The keywords for 2022 visualized and weighted by their TF-IDF scores.

‍

Exploring the Embedding Space Created by LLMs

Our second exercise at the workshop explored the dimensionally reduced embedding space created by an LLM, using OpenAI's new text embedding model. We visualized how news articles about different topics, such as cars and trains, can cluster with articles about buses. This hints that buses sit between cars and trains in terms of concept.

This visualization sparked some nice discussions about the pre-training of models, and how this process manages or fails to capture nuances that align with human intuition. These discussions seemed to be of interest to everyone, even those not well-versed in NLP.

Above you can see the reduced embedding space of BERTopic visualized. It consists of a 2D map of dense clusters of points, where each point represents an article, and each cluster represents a topic. Here, we zoom in on articles about car accidents and trains, which are grouped closely together. Where the two clusters intersect, you will find articles about both cars and trains or, and this case, also about buses.

‍

Visualizing Geospatial Distribution of News Topics

The third exercise at the workshop involved visualizing the geospatial distribution of news topics using a kernel density plot. This was done using the coordinates derived from Named Entity Recognition (NER). We noticed that many topics had peaks around Utrecht, and that a lot of clusters centered around densely populated areas. This included The Hague, where the Dutch government is situated, making it a frequent metonymy in news articles.

Other topics, like news articles covering the Groningen gas drilling crisis, were more localized. During this exercise, some attentive audience members pointed out an anomaly — many topics were peaking around Utrecht. This was because we had assigned a latitude and longitude in the center of the country, near Utrecht, for articles that could only be categorized at the country level. Since Utrecht is in the geographical center of the Netherlands, it became the default location for all these unclassified articles. It was great to see the audience being so sharp. Being humbled by clever points is a good reminder of the importance of detail in data analysis.

Above you can see the geographic hotspots of a topic from NOS. Can you guess what this topic is? Hint: Consider the hotspots in Rotterdam and North-Brabant.

‍

Time-Series Comparison

In our fourth exercise at the workshop, we plotted the relative frequency of topics over time, and compared these trends with existing CBS data on socioeconomic perceptions. This involved calculating the frequency of topics each month. Then, we analyzed them against economic indicators like consumer confidence and spending willingness.. We found some intriguing patterns , which participants with an economic background found particularly interesting. This led to a discussion on how traditional time-series analysis could be enhanced with news data, and how to improve the validity of time-series analysis. We are particularly interested in these techniques as they may help us improve our poverty mapping tool.

The chart above shows CBS consumer trust data for the next 12 months. The trust data is inverted and in orange. The chart also shows mentions of a topic about mortgages and homes, which are in blue. For some periods, the inverse relationship of these topics is quite clear. In other words, when the news mentions mortgages and housing peaks, consumers have a low degree of trust in the economy.

In a more involved exercise, we combined daily political polls against the news dataset, tracking how often terms linked to political figures, like 'Wilders', correlated with shifts in poll numbers using TF-IDF. We found that political parties consistently gain or lose voters when Wilders is mentioned in news articles associated with them during that week. This analysis confirmed the widely held belief that Wilders has a significant influence on the political debate. It was nice to see how a simple technique like TF-IDF can provide interesting insights on this.

‍

Final Thoughts on the Workshop

To give some background on myself, I had started playing around with news analysis as a student a few years ago. For one project, I tried to categorize the news, and track these categories over time. However, I found that it's very difficult to define a detailed and comprehensive set of categories manually when you are working with thousands of articles. Luckily for me, BERTopic had recently been released. In this pre-GPT era, I remember being amazed at how effectively language models could be used to discover trends.

I was grateful to share some of that same excitement with other data scientists at the Kickstart AI workshop. This workshop was also a chance for our engineering team to learn from an audience of 40 data scientists, students and software developers, gathering insights that could help shape our approach to using news as a data source. Having worked with NLP and news data for so long, you lose the sense of wonder at the workings and possibilities of these techniques. So it was nice to see that the topic of the workshop was truly insightful for those new to the field. I'm looking forward to more KickstartAI workshops in the future!

If you're curious about the workshop, like which topic stole the show in Rotterdam and North Brabant, just reach out to me on LinkedIn: Cascha van Wanrooij. I'll be happy to fill you in!