Skip to content

Analysis and experiments on the UN General Debate corpus

License

Notifications You must be signed in to change notification settings

llefebure/un-general-debates

Repository files navigation

UN General Debates Analysis

This repo is a collection of experiments on historical speeches made at the UN General Debate, an annual forum for world leaders to discuss issues affecting the international community. These speeches form a historical record of these issues and represent an interesting corpus of text for analyzing this narrative over time.

In particular, I implement and apply various NLP methods for topic modelling, discovery, and interpretation, as well as studying how to model changes in topics over time.

The full text of these speeches was compiled and cleaned by researchers in the UK and Ireland, who used this data to study the position of different countries on various policy dimensions. See more info here.

Setup

The structure of this project is based on the cookiecutter data science project template. If you would like to run or extend this code yourself, follow the steps below for setting up a local development environment.

Environment

Create the environment with:

make create_environment

This will create a Python3 virtualenv or conda environment (if conda is available). Next, activate the environment. With conda (what I'm using) this is:

source activate un-general-debates

Finally, install the requirements with the following command. Note that this uses pip behind the scenes. There are some packages that are not easily pip installable, so this command won't install everything. Look at the commented section in requirements.txt for packages that you should install manually.

make requirements

Data

There are two raw data files that are used in this project. They are:

This project also uses pretrained Wikipedia2vec vectors for word and entity (Wikipedia page) embeddings. See here for more details.

The following will download all of the raw data and preprocess the appropriate files for you. If you haven't used the Kaggle API before, some additional setup will be required for this to work.

make data

Preprocessed data files are written to data/processed/.

Methods

Paragraph Tokenization

A key observation in this dataset is that each of these speeches consists of discussion on a multitude of topics. If every speech contains discussion on poverty and terrorism, a topic model such as LDA trained on entire speeches as documents, for example, will have no way of understanding that terms like "poverty" and "terrorism" should be representative of different topics.

To counter this problem, I tokenize each speech into paragraphs and treat each paragraph as a separate document for analysis. A simple rule based approach that looks for sentences separated by a newline character performs reasonably well on the task of paragraph tokenization for this dataset. After this step, the number of documents jumps from around 7,500 (full speeches) to nearly 300,000 (paragraphs).

Topic Modelling

LDA

Applying LDA with gensim and visualizing resulting topics with pyLDAvis reveals some easily interpretable topics such as nuclear weapons, Africa, and Israel/Palestine. See notebooks/LDA.ipynb.

Dynamic Topic Modelling

A Dynamic Topic Model [1] is basically an extension of LDA to allow topic representations to evolve over fixed time intervals such as years. I wrote about applying this method here. As an example, the model learned a topic about "Human Rights", and for this topic, a plot of probabilities over time for selected terms is shown below. Note the rising use of "woman" and "gender", the decline of "man", and the inverse relationship between "mankind" and "humankind".

Human Rights Topic Probabilities

This code uses gensim's wrapper to the original C++ implementation to train DTMs. See the docs for instructions on setup. You will need to either download a precompiled binary or build one manually.

To train a DTM on this dataset, refer to src/models/dtm.py. Note that the inference takes quite a while: almost 8 hours for me on a n1-standard-2 (2 vCPUs, 7.5 GB memory) instance on Google Cloud Platform. The script will save the model and a copy of the processed data into models/, and you can use the notebook notebooks/DTM.ipynb to explore the learned topics.

Representing Topics

Topics are typically represented by a list of the top N terms in the topic by conditional probability. However, studies have shown that introducing a measure of exclusivity in this ranking can aid in the interpretation [5][7]. Intuitively, we want terms that are not only prevalent in the given topic but also fairly exclusive to this topic- terms with a high delta between their conditional probability in the topic and marginal probability across the corpus. I implement the simple relevance weighting scheme described in [5].

Topic Labelling

The topic models above represent topics as ranked lists of terms. Interpreting learned topics through lists of individual words has a high cognitive overhead even when using the term reranking scheme mentioned above, so several methods for labelling topics in a more human friendly way have been developed such as [2] and [3]. These methods utilize chunking/noun phrases, frequency statistics, Wikipedia titles/concepts, and more to generate descriptive phrases to describe topics.

Taking inspiration from [2] especially, I develop a method for labelling topics leveraging pretrained word and entity (Wikipedia title) embeddings from Wikipedia2vec [4]. Specifically, my method consists of the following steps to label a given topic:

  1. For each document assigned to the topic (according to the document's highest topic probability), extract all noun chunks and match each to a corresponding Wikipedia entity (if one exists). Take the top N entities (based on the differential between frequency within and outside of the topic) as label candidates for the next step.

  2. For each label candidate from 1), compute the mean similarity between its embedding and that of each word in the term list that represents the topic (top M terms by probability or relevance score). Rank the label candidates by this score.

  3. Represent the topic with the top M label candidates as ranked in 2).

In the Dynamic Topic Model case, I generalize the top term list in step 2) to just be an aggregation of the top term lists across all time slices.

Here are some examples of outputs from this method on the DTM. We see the top terms from time slices of the Dynamic Topic Model, the top labels assigned to these time slices, and a label for the topic across all time slices.

Nuclear Weapons Topic Labels

Israel Palestine Topic Labels

Human Rights Topic Labels

Semantic Hashing

In [6], the authors present an interesting method for hashing documents using a deep generative model. I implemented the unsupervised version of the model that uses a VAE to encode a TFIDF vector and decode into a softmax distribution over the vocabulary. This could be used as a preprocessing step to bucket documents before applying more expensive pairwise comparison methods on documents within buckets.

References

[1] Dynamic Topic Models

[2] Automatic Labelling of Topics with Neural Embeddings

[3] Automatic Labeling of Multinomial Topic Models

[4] Wikipedia2Vec: An Optimized Tool for Learning Embeddings of Words and Entities from Wikipedia

[5] LDAvis: A method for visualizing and interpreting topics

[6] Variational Deep Semantic Hashing for Text Documents

[7] Summarizing topical content with word frequency and exclusivity

About

Analysis and experiments on the UN General Debate corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published