Topic Modelling

Modified on Wed, 26 Feb, 2020 at 2:16 PM

In Brief

Function: Topic Modelling analyses text documents for word co-occurences. Words which are frequently seen together are put into topics. These topics can be used for categorisation of the documents and to understand their topical structure.

Typical Use Case: A client obtains a large corpus of text, e.g. emails, tweets or reports and wants to sort them to be able identify sets of documents important for their task at hand.

Case study

Here, we have a dataset containing posts from a newsgroup. Newsgroups are essentially forums that are organised in subgroups. The discussions in each subgroup are about a wide range of topics. A company running a newsgroup is interested in providing targeted advertising to its users and needs to understand what its users are talking about.

Workflow

We construct a workflow by connecting the newsgroup demo data to the "Documents" input, and stopwords data to the "Stopwords" input. Stopwords are very common words such as "have", "would", etc. which do not convey a meaning in the sense of topic modelling. They will be removed from the input documents.

Example Topic Modelling workflow

Input Data

The input data consists simply of a single field which contains the text of the posts.

Example newsgroups demo input data

Options

The next step is configuring the Topic Modelling block.

Input

Input options

Fields containing text: Select here which of your fields contain the texts from which you would like to extract the topics.

Preprocessing

Preprocessing options

Fields containing stopwords: Select here which of your fields from the stopwords input contain the stopwords (words which contain no specific meaning and should be removed from your text).

Remove entities: You can select here entities, such as usernames, hashtags, urls or repetetive elements, which should be removed from your documents.

Minimum word length: In many languages, very short words generally do not carry important meaning in the context of topics. Removing these terms keeps your data clean and generates a better and faster output. Set this to a lower value for languages such as Chinese, in which a single character might form an important word.

Modelling

Modelling options

Term limit: Omniscope will only use the select amount of terms from your documents to form topics. Increasing this number might lead to richer topics, but comes at higher computational demand.

N-Grams: Generally, every single word will be used to create topics. Sometimes it makes sense to consider word combinations. E.g. the words "white" and "house" are often used together to form "white house". This would constitute a 2-gram.

Topics

Topics options

Number of topics per document: Sometimes documents are about multiple topics. A post about FIFA elections is both about sports and politics. Choose here if you would like to get a list of all topics found in the document (Many) or if you would just like to get the most dominant one (One).

Number of times a word is used in topics: This option enables overlapping topics if set to "multiple times". Overlapping topics typically share a set of common words and differ in more specialised words. E.g. topics "football" and "hockey" will most likely contain general sport terms such as "game" and "victory" and more specialised terms such as "offside" and "puck". If "Multiple times" is selected, both topics will contain "game" and "victory". If "Once" is chosen, only one of the topics will contain these terms.

Preseeded topics: If you know that your corpus contains specific topics which you would like to be able to track, you can enter terms from these topics in order to preseed them. This guarantees that these topics with these words will be detected and part of the output.

Hierarchy

Hierarchy options

Include topic hierarchies: Topics can be arranged in a tree shape whereas topics around the root are more general and towards the leafs more specific. If this option is enabled, the topic model contains the full hierarchical tree. If it is disabled the output contains only the leaf nodes.

Topic specificity: Determines whether the topics should be very specific (1) or extremely broad (0). Very specific topics would constitute "football" and "hockey", while a broader topic would be "sports".

Output

Output options

Include uncategorised documents: It is sometimes not possible to categorise every single document. If this option is enabled, the categorisation output will also contain uncategorised documents, but their fields specifying topic information will be empty.

Result

Topic Model: This output specifies the full hierarchical topic tree including terms, quality of the topic and hierarchical tiers. It can be used within the network view to visualise the hierarchical structure.

Topic Terms: This is a pivoted view of the topic model in which each record refers to a term. In addition to topic specific information in the topic model output, this output yields term specific data.

Term Cooccurences: Instead of single terms, here we look at pairs of terms which have been frequently seen together. It provides statistics about the connection between terms.

Categorised Documents: Using the models, this output contains all the input documents (either from "Documents" or from "New") and their topic assignments. If the option "Number of topics per document" was set to "many", then there might be multiple output records per input record, one for each topic assignment.