LDA

1. Einleitung

LDA steht für Latent Dirichlet Allocation und ist einer der beliebtesten Ansätze zur Themenmodellierung. Das Ziel der Themenmodellierung ist die automatische Zuordnung von Themen zu Dokumenten, ohne dass eine menschliche Überwachung erforderlich ist. Es handelt sich hierbei also um eine unsupervised classification. Obwohl die Mathematik hinter dem LDA-Algorthimus recht herausfordernd ist, ist es sehr einfach, ein LDA-Themenmodell in R zu erstellen.

Um zu verstehen, was Themenmodelle sind und wofür sie nützlich sein können, werden wir im Folgenden ein paar Dinge auszuprobieren. Dafür nutzen wir die Antrittsreden der U.S. Präsidenten und erstellen zu erst eine DFM auf Absatzebene:

library(quanteda)

## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1

## Parallel computing: 4 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

library(quanteda.textplots)
library(quanteda.textstats)

inaug_corpus <-  corpus_reshape(data_corpus_inaugural, to = "paragraphs")

inaug_dfm <-  inaug_corpus %>% 
  tokens(remove_numbers = TRUE, remove_punct = TRUE,remove_symbols = TRUE) %>% 
  tokens_remove(pattern = stopwords("english")) %>% 
  tokens_tolower() %>%
  dfm() 

inaug_dfm <-  dfm_trim(inaug_dfm)

1.1 Ein LDA Modell berechnen

Um ein LDA-Modell zu berechnen, konvertieren wir unsere DFM zunächst in das Topicmodels-Format (hierfür ist das topicmodels-Paket notwendig) und führen dann die Berechnung aus. Beachtet bitte die Verwendung von set.seed(.). Damit stellen wir die Reproduzierbarkeit der Analyse sicher.

library(topicmodels)
inaug_topicmod <-  convert(inaug_dfm, to = "topicmodels") 

set.seed(1)
inaug_lda <-  LDA(inaug_topicmod, method = "Gibbs", k = 10,  control = list(alpha = 0.1))
inaug_lda

## A LDA_Gibbs topic model with 10 topics.

Obwohl ein LDA-Modell die Absätze automatisch in Themencluster klassifiziert, müssen wir selbst entscheiden, wie viele Themen wir überhaupt finden wollen (hier k = 10).

Das ist das größte Problem bei LDA-Modellen. Es gibt keine etablierte Methode, um die korrekte Anzahl von k zu bestimmen. Im Idealfall wird die Anzahl von k theoriebasiert durch euch festgelegt, aber auch das ist meistens sehr schwer und zweitens sehr selten. Alternativ gibt es Wege die Anzahl der Themen mathematisch bestimmen zu lassen, was aber ebenfalls umstritten ist. Letztlich ist die Definition von k immer ein trail-and-error Prozess: wir verändern k so lange, bis wir mit den Eregbnissen zufrieden sind. Wären wir jetzt im Mitten eines Forschungsprozesses würden wir viele weitere Modelle mit unterschiedlich großen k berechnen und deren Ergebnisse vergleichen. Von daher hat die LDA-Methode immer ein gewisses arbiträres Momentum.

Außerdem gibt es bestimmte Hyperparameter (Alpha), an denen wir herumschrauben können, um eine gewisse Kontrolle über die Verteilung der Absätze in die Themen zu haben. Für mehr Details konsultiert die help-Seiten der Pakte und die Fachliteratur.

1.2 Die LDA Eregbnisse inspizieren

Um zu sehen wie gut oder schlecht unser LDA-Modell gearbeitet hat, ist es am einfachsten die Begriffe zu untersuchen:

terms(inaug_lda, 5)

##      Topic 1   Topic 2        Topic 3 Topic 4      Topic 5  Topic 6  
## [1,] "shall"   "government"   "may"   "public"     "world"  "years"  
## [2,] "upon"    "states"       "can"   "government" "peace"  "now"    
## [3,] "may"     "people"       "upon"  "laws"       "must"   "world"  
## [4,] "people"  "constitution" "law"   "can"        "people" "history"
## [5,] "country" "union"        "good"  "revenue"    "can"    "us"     
##      Topic 7     Topic 8   Topic 9  Topic 10 
## [1,] "president" "nations" "every"  "us"     
## [2,] "mr"        "war"     "people" "let"    
## [3,] "justice"   "united"  "can"    "must"   
## [4,] "fellow"    "peace"   "life"   "america"
## [5,] "citizens"  "foreign" "home"   "can"

Hier seht ihr welche Wörter welchen Themen zugeordnet wurden. Es ist nun die Aufgabe der Forscherin diese Ergebnisse zu interpretieren und passende Überschriften/Lables zu geben (dieser Prozess der Interpretation der Ergebnisse muss theoriegeleitet sein). Ihr seht also, dass nicht nur die Definition von k subjektiv erfolgt, sondern auch die Interpretation der Ergebnisse.

Die posterior- Funktion gibt die Posterior-Verteilung von Wörtern und Dokumenten zu Themen an, die verwendet werden kann, um eine word cloud von Begriffen proportional zu ihrem Vorkommen darzustellen:

topic <-  6
words_topic6 <-  posterior(inaug_lda)$terms[topic, ]
topwords_topic6 <-  head(sort(words_topic6, decreasing = T), n=50)
head(topwords_topic6)

##       years         now       world     history          us         new 
## 0.018594129 0.012892665 0.012340911 0.011237402 0.009766056 0.009582138

Diese Wörter können wir natürlich plotten:

library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(names(topwords_topic6), topwords_topic6)

Wir können uns auch die Themen pro Dokument ansehen, um so die Top-Dokumente pro Thema zu finden:

topic_per_docs <-  posterior(inaug_lda)$topics[, topic] 
topic_per_docs <-  sort(topic_per_docs, decreasing=T)
head(topic_per_docs)

##     2009-Obama.13 2021-Biden.txt.18 2021-Biden.txt.31      2017-Trump.3 
##         0.9250000         0.9250000         0.9181818         0.9100000 
##       1989-Bush.3     2017-Trump.12 
##         0.9025000         0.9000000

Das können wir jetzt mit unserem Korpus kombinieren:

inaug_topdoc <-  names(topic_per_docs)[1]
inaug_topdoc_corp <-  inaug_corpus[docnames(inaug_corpus) == inaug_topdoc]
texts(inaug_topdoc_corp)

## Warning: 'texts.corpus' is deprecated.
## Use 'as.character' instead.
## See help("Deprecated")

##                                                                                                        2009-Obama 
## "For us, they toiled in sweatshops and settled the West; endured the lash of the whip and plowed the hard earth."

Schließlich können wir untersuchen, welcher Präsident welche Themen bevorzugt:

inaug_docs <-  docvars(inaug_dfm)[match(rownames(inaug_dfm), docnames(inaug_dfm)),]
topics_per_pr <-  aggregate(posterior(inaug_lda)$topics, by=inaug_docs["President"], mean)
rownames(topics_per_pr) <-  topics_per_pr$President
heatmap(as.matrix(topics_per_pr[-1]))

Wie ihr sehen könnt, bilden die Themen eine Art "Block"-Verteilung, wobei modernere Präsidenten und ältere Präsidenten ganz unterschiedliche Themen verwenden. Also hat sich entweder die Rolle der Präsidenten geändert, der Sprachgebrauch, oder aber (wahrscheinlich) beides. Ich kann mich aber nur noch einmal wiederholen: da die Auswahl der Themenanzahl und die Interpretation der Themen zu 100% von der Forscherin abhängig ist, sind LDA-Modelle immer mit einer gewissen Vorsicht zu genießen!

Um eine bessere Anpassung an zeitliche Dynamiken zu realisieren, wenden wir uns jetzt den structural topic models (STM). STMS erlauben es uns Themenmodelle mit weiteren Inhaltsvariablen zu konditionieren.

STM

2. Einleitung

STM sind eine Erweiterung von LDA. STMs erlauben es Metadaten wie Datum oder Autor als Kovariaten der Themenprävalenz- und/oder Themenverteilung zu modellieren. Mit dem stm-Paket gibt es eine ausgezeichnete Implementierung für R (siehe http://structuraltopicmodel.com).

Für unsere Beispiele bleiben wir bei den Antrittsreden der US Präsidenten. Da wir unseren Korpus und unsere DFM mit quanteda erstellt haben, können wir sicher sein, dass wir Metadaten zur Verfügung haben.

inaug_corpus

## Corpus consisting of 1,759 documents and 4 docvars.
## 1789-Washington.1 :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1789-Washington.2 :
## "Among the vicissitudes incident to life no event could have ..."
## 
## 1789-Washington.3 :
## "Such being the impressions under which I have, in obedience ..."
## 
## 1789-Washington.4 :
## "By the article establishing the executive department it is m..."
## 
## 1789-Washington.5 :
## "Besides the ordinary objects submitted to your care, it will..."
## 
## 1789-Washington.6 :
## "To the foregoing observations I have one to add, which will ..."
## 
## [ reached max_ndoc ... 1,753 more documents ]

inaug_dfm

## Document-feature matrix of: 1,759 documents, 9,212 features (99.64% sparse) and 4 docvars.
##                    features
## docs                fellow-citizens senate house representatives among
##   1789-Washington.1               1      1     1               1     0
##   1789-Washington.2               0      0     0               0     1
##   1789-Washington.3               0      0     0               0     0
##   1789-Washington.4               0      0     0               0     0
##   1789-Washington.5               0      0     0               0     0
##   1789-Washington.6               0      0     1               1     0
##                    features
## docs                vicissitudes incident life event filled
##   1789-Washington.1            0        0    0     0      0
##   1789-Washington.2            1        1    1     1      1
##   1789-Washington.3            0        0    0     1      0
##   1789-Washington.4            0        0    0     0      0
##   1789-Washington.5            0        0    0     0      0
##   1789-Washington.6            0        0    0     0      0
## [ reached max_ndoc ... 1,753 more documents, reached max_nfeat ... 9,202 more features ]

2.1 Ein STM Modell berechnen

Mittels des stm-Paket werden wir jetzt unsere Modell berechnen. Vorerst ohne Metavariablen/Kovariaten:

library(stm)

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

inaug_stm <-  stm(inaug_dfm, K = 10, max.em.its = 10)

## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..........
##   Recovering initialization...
##      ............................................................................................
## Initialization complete.
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -8.123) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -7.989, relative change = 1.643e-02) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -7.933, relative change = 7.036e-03) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -7.907, relative change = 3.221e-03) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -7.894, relative change = 1.655e-03) 
## Topic 1: every, time, nations, without, know 
##  Topic 2: nation, free, rights, state, men 
##  Topic 3: never, justice, law, duty, hope 
##  Topic 4: us, new, constitution, can, progress 
##  Topic 5: must, country, made, one, part 
##  Topic 6: people, government, shall, peace, union 
##  Topic 7: upon, world, citizens, american, spirit 
##  Topic 8: great, now, america, national, just 
##  Topic 9: can, united, let, life, one 
##  Topic 10: may, states, power, best, history 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -7.887, relative change = 9.506e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -7.882, relative change = 6.078e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 8 (approx. per word bound = -7.879, relative change = 4.301e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 9 (approx. per word bound = -7.876, relative change = 3.321e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Terminated Before Convergence Reached

K legt die Anzahl der Themen fest. Wie bei der LDA liegt diese Auswahl komplett in eurer Hand und muss theoriegeleitet ablaufen. max.em.hits legt die Anzahl der Iterationen ("Iteration beschreibt allgemein einen Prozess mehrfachen Wiederholens gleicher oder ähnlicher Handlungen zur Annäherung an eine Lösung oder ein bestimmtes Ziel." Quelle: Wikipedia) fest. Zehn Iterationen ist wahrscheinlich zu niedrig, aber da dieses Beispiel nur Demonstrationszwecken dient, stellen wir so eine halbwegs schnelle Rechnenzeit sicher. Wie immer lege ich euch nahe, dass ihr euch über die Parameter selber informiert.

2.2 Das STM Modell inspizieren

STM funktioniert ähnlich wie LDA, aber es modelliert auch Korrelationen zwischen Themen. Um die Ergebnisse des Themenmodells zu untersuchen, können wir weitere Funktionen aus stm-Paket verwenden:

plot(inaug_stm, type="summary", labeltype = "frex")

labelTopics(inaug_stm, topic=8)

## Topic 8 Top Words:
##       Highest Prob: great, now, america, national, just, confidence, make 
##       FREX: expect, clear, fact, judgment, confidence, conflict, supreme 
##       Lift: summons, urban, bigoted, coherence, fundamentally, label, polls 
##       Score: great, america, now, national, just, summons, make

Wir können auch die Wörter pro Thema und die Wörter 'zwischen' zwei Themen darstellen:

cloud(inaug_stm, topic=8)

plot(inaug_stm, type="perspectives", topics=c(8,9)) # in diesem fall zwischen den themen 8 und 9

2.3 Kovariablen

Jetzt modellieren wir das Jahr als Kovarianz für unser Modell (prevalence =~ Year):

inaug_stm_year <- stm(inaug_dfm, K = 10, prevalence =~ Year, max.em.its = 10)

## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..........
##   Recovering initialization...
##      ............................................................................................
## Initialization complete.
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -8.123) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -7.989, relative change = 1.644e-02) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -7.933, relative change = 7.076e-03) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -7.907, relative change = 3.241e-03) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -7.894, relative change = 1.676e-03) 
## Topic 1: every, time, nations, without, know 
##  Topic 2: nation, free, rights, state, men 
##  Topic 3: never, justice, law, duty, hope 
##  Topic 4: us, new, constitution, can, progress 
##  Topic 5: must, country, made, one, part 
##  Topic 6: people, government, shall, peace, union 
##  Topic 7: upon, world, citizens, american, spirit 
##  Topic 8: great, now, america, national, just 
##  Topic 9: can, united, let, life, one 
##  Topic 10: may, states, power, best, history 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -7.886, relative change = 9.723e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -7.881, relative change = 6.293e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 8 (approx. per word bound = -7.877, relative change = 4.512e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 9 (approx. per word bound = -7.875, relative change = 3.537e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Terminated Before Convergence Reached

Neben den oben genannten Funktionen können wir auch den Effekt der Jahre mit der Funktion estimateEffect modellieren:

inaug_year_effects <- estimateEffect(1:10 ~ Year, stmobj = inaug_stm_year, meta = docvars(inaug_dfm))
summary(inaug_year_effects, topics=1:8)

## 
## Call:
## estimateEffect(formula = 1:10 ~ Year, stmobj = inaug_stm_year, 
##     metadata = docvars(inaug_dfm))
## 
## 
## Topic 1:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.015e-01  5.402e-02  11.135   <2e-16 ***
## Year        -2.679e-04  2.767e-05  -9.679   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 2:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.730e-03  5.517e-02   0.031    0.975
## Year        4.087e-05  2.839e-05   1.440    0.150
## 
## 
## Topic 3:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 7.689e-02  4.316e-02   1.781    0.075 .
## Year        2.954e-06  2.220e-05   0.133    0.894  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 4:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.587e-01  5.196e-02  -3.054   0.0023 ** 
## Year         1.271e-04  2.681e-05   4.742 2.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 5:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -4.110e-02  4.407e-02  -0.932  0.35124   
## Year         6.930e-05  2.275e-05   3.046  0.00235 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 6:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.361e-01  4.637e-02   5.091 3.93e-07 ***
## Year        -5.528e-05  2.379e-05  -2.324   0.0202 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 7:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.144e-02  5.152e-02   0.804    0.421
## Year        3.378e-05  2.654e-05   1.273    0.203
## 
## 
## Topic 8:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.956e-02  4.503e-02   0.879    0.380
## Year        3.514e-05  2.332e-05   1.507    0.132

Die Themenprävalenzen über die Zeit können wir natürlich auch visualisieren:

plot(inaug_year_effects, "Year", method = "continuous", topics = c(8,9), model = inaug_stm_year)

Schließlich fügen wir noch die jeweiligen Präsidenten als inhaltlische Kovariate hinzu, da jeder Präsident wahrscheinlich unterschiedliche Wörter verwendet, um ähnliche Themen zu diskutieren. Wir fügen diese Kovariate mit dem Argument content= ~ hinzu:

inaug_stm_präs <- stm(inaug_dfm, K = 10, content =~ President, max.em.its = 10)

## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..........
##   Recovering initialization...
##      ............................................................................................
## Initialization complete.
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (149 seconds). 
## Completing Iteration 1 (approx. per word bound = -8.123) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (176 seconds). 
## Completing Iteration 2 (approx. per word bound = -7.219, relative change = 1.113e-01) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (226 seconds). 
## Completing Iteration 3 (approx. per word bound = -7.214, relative change = 6.308e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (227 seconds). 
## Completing Iteration 4 (approx. per word bound = -7.212, relative change = 3.686e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (204 seconds). 
## Model Converged

Wenn wir nun nach den Top-Begriffen fragen, erhalten wir sowohl die Begriffe pro Präsident als auch pro Thema:

labelTopics(inaug_stm_präs, topics = 8)

## Topic Words:
##  Topic 8: now, national, great, confidence, just, full, founded 
##  
##  Covariate Words:
##  Group Adams: exploring, hunters, potentates, researches, revised, surviving, traces 
##  Group Biden: year's, manipulated, person's, shoes, urgency, challenging, virus 
##  Group Buchanan: fairer, inhabitant, nebraska-kansas, residents, alienated, benefiting, defray 
##  Group Bush: america's, hurts, mentor's, mosque, pastor's, synagogue, youngest 
##  Group Carter: exemplify, mediocrity, emulation, micah, attest, coleman, julia 
##  Group Cleveland: enforced, frugality, unstrained, appalled, wards, commands, impatience 
##  Group Clinton: defied, persian, somalia, testament, biological, chemical, flexible 
##  Group Coolidge: represents, array, bearings, continually, essentials, firmament, accurately 
##  Group Eisenhower: fired, needy, seeds, self-righteousness, uttering, balk, deter 
##  Group Garfield: enfeebled, gentleness, laying, self-support, unquestioning, ecclesiastical, scruples 
##  Group Grant: steam, telegraph, rehabilitated, contiguous, debtor, farthing, repudiator 
##  Group Harding: autocracy, intuitions, nation-wide, lightened, mindfulness, omission, squared 
##  Group Harrison: attributable, bosoms, observable, brutus, camillus, conqueror, curtii 
##  Group Hayes: merits, afflict, arrested, truest, calamitous, supplemented, furtherance 
##  Group Hoover: eighteenth, broadening, engrossed, extinction, negation, observers, superficial 
##  Group Jackson: overrule, preparatory, counteract, profuse, appropriation, consist, impost 
##  Group Jefferson: arraignment, bulwarks, burthened, compress, corpus, habeas, handmaid 
##  Group Johnson: vigilantly, clamor, underneath, tears, awaited, rekindle, reopen 
##  Group Kennedy: l, nixon, signifying, symbolizing, truman, communists, huts 
##  Group Lincoln: construe, hypercritical, specify, unrepealed, rejects, refuses, secede 
##  Group Madison: deserters, emigrating, naturalizing, traitors, avowed, battles, aliment 
##  Group McKinley: conforming, deems, discharging, evacuation, historical, involving, assisting 
##  Group Monroe: incumbent, dense, digest, endearing, lieu, paved, retarded 
##  Group Nixon: erode, frustration, flimsy, interlude, passionately, dirksen, humphrey 
##  Group Obama: hindus, jews, muslims, non-believers, segregation, swill, tasted 
##  Group Pierce: float, folds, gallantry, self-devotion, trustful, unobtrusive, waved 
##  Group Polk: chances, exceptions, incautiously, liabilities, perceive, indulged, misguided 
##  Group Reagan: mathias, inches, miles, punitive, reawaken, roadblocks, slowed 
##  Group Roosevelt: befit, washington's, weld, opportunism, timidity, translated, exhortations 
##  Group Taft: progressing, redounds, sugar, tobacco, unabated, upholding, porto 
##  Group Taylor: doctrines, warned, manifold, sympathize, devolved, conform, abstain 
##  Group Truman: defeatism, losing, equipment, technical, freedom-loving, partners, brutal 
##  Group Trump: bedrock, reaped, obama, neighborhoods, complaining, winning, mountain 
##  Group Van Buren: countless, far-distant, hastily, impresses, overbalanced, predicted, theorists 
##  Group Washington: arrive, adorn, behold, congenial, designates, disregards, indissoluble 
##  Group Wilson: audience, delegation, aspects, fortuitous, rectify, accident, sweep 
##  
##  Topic-Covariate Interactions:
##  Topic 8, Group Adams: throng, prerogative, intend, influenced, comparatively, cease, vast 
##  Topic 8, Group Biden: desirable, entirely, performed, fraternal, realized, jurisdiction, period 
##  Topic 8, Group Buchanan: pronounce, prosecuted, reposes, tones, uncomplaining, animating, counsels 
##  Topic 8, Group Bush: confine, confess, concentration, blot, comprised, disguised, dishonor 
##  Topic 8, Group Carter: all-pervading, carolina, cheered, circumscribed, comprehension, delusive, fearfully 
##  Topic 8, Group Cleveland: skill, unsolicited, untarnished, planted, performance, apparent, lights 
##  Topic 8, Group Clinton: exhortation, fragments, green, monticello, reunite, slopes, smiled 
##  Topic 8, Group Coolidge: densely, evidences, intellects, populated, skirt, threefold, unfounded 
##  Topic 8, Group Eisenhower: genial, orleans, southwestern, arts, discretion, protecting, tribunal 
##  Topic 8, Group Garfield: veto, minorities, proposition, negative, executes, adoption, reform 
##  Topic 8, Group Grant: despots, disastrous, imitate, religiously, ruinous, shudder, surer 
##  Topic 8, Group Harding: boldest, distrusted, tremble, younger, transfers, periodically, encroaching 
##  Topic 8, Group Harrison: shrinks, mechanic, frontier, discriminating, articles, watchfulness, specified 
##  Topic 8, Group Hayes: texas, conventional, eighty, headsprings, hearers, missouri, oregon 
##  Topic 8, Group Hoover: befitting, concise, enumeration, caprice, dispense, oppressing, usurpations 
##  Topic 8, Group Jackson: artisans, jeopard, remunerating, skillful, studiously, agriculturists, levying 
##  Topic 8, Group Jefferson: heaven-favored, mischiefs, omnipotence, arid, adjusting, consume, auspicious 
##  Topic 8, Group Johnson: discountenanced, heartburnings, delegated, consummate, disguise, solicitation, judiciously 
##  Topic 8, Group Kennedy: two-party, heroes, freezing, counter, untamed, whoever, yes 
##  Topic 8, Group Lincoln: bloated, futile, prescription, reelected, sending, well-intentioned, deficits 
##  Topic 8, Group Madison: demilitarize, militarize, obsolete, rid, target, churchill, paraphrase 
##  Topic 8, Group McKinley: baker, every-4-year, hatfield, mondale, moomaw, occurrence, routinely 
##  Topic 8, Group Monroe: boston, dared, lawyer, reestablished, retired, rivals, softened 
##  Topic 8, Group Nixon: infirm, disadvantaged, handled, upgrade, city's, giants, shrines 
##  Topic 8, Group Obama: row, david, fraction, heroism, sloping, tiny, freeing 
##  Topic 8, Group Pierce: compassion, columns, oar, rode, sunset, penalizes, causing 
##  Topic 8, Group Polk: beach, belleau, chop, chosin, guadalcanal, halfway, hero 
##  Topic 8, Group Reagan: machine, rob, yielded, belonged, burger, killing, retaliate 
##  Topic 8, Group Roosevelt: cancers, uncorrupted, stricken, changers, aught, faintness, perils 
##  Topic 8, Group Taft: self-serving, undeserved, hiding, intense, outgrown, unbending, unflinching 
##  Topic 8, Group Taylor: ebbing, surging, unexplained, hangs, incomes, pall, individualists 
##  Topic 8, Group Truman: drastically, engaging, foreclosure, output, overbalance, planning, talking 
##  Topic 8, Group Trump: invigorated, rested, condoned, formerly, hard-headedness, hardheartedness, labeled 
##  Topic 8, Group Van Buren: three-score, count, multitudes, gettysburg, spot, heights, respects 
##  Topic 8, Group Washington: economics, fashioning, morally, pays, practicality, relearned, unimagined 
##  Topic 8, Group Wilson: recovery, rounded, dulled, persistence, portents, reappear, symptoms 
##

Wir können die Wortwahl pro Präsident als perspective Graph visualisieren:

plot(inaug_stm_präs, type="perspectives", topics=8, covarlevels = c("Obama", "Trump"))

So können wir schließlich Texte finden, in denen ein bestimmter Präsident über ein bestimmtes Thema spricht:

findThoughts(inaug_stm_präs, texts = texts(inaug_corpus), topics = 8, where=President=="Trump", meta=docvars(inaug_dfm))

## Warning: 'texts.corpus' is deprecated.
## Use 'as.character' instead.
## See help("Deprecated")

## 
##  Topic 8: 
##       We, the citizens of America, are now joined in a great national effort to rebuild our country and restore its promise for all of our people.
##      But for too many of our citizens, a different reality exists: mothers and children trapped in poverty in our inner cities; rusted-out factories scattered like tombstones across the landscape of our nation; an education system, flush with cash, but which leaves our young and beautiful students deprived of all knowledge; and the crime and the gangs and the drugs that have stolen too many lives and robbed our country of so much unrealized potential.
##      We will make America wealthy again.

Abschließend können wir die Korrelationsstruktur des Modells darstellen:

inaug_stm_präs_corr <-  topicCorr(inaug_stm_präs)
plot(inaug_stm_präs_corr)

Natürlich sind wir nicht nur auf eine einzige Kovariate beschränkt. Das folgende Beispiel verwendet sowohl das Jahr als auch den Präsidenten für die Prävalenz und den Präsidenten als Inhaltskovariate verwenden:

inaug_stm_complex <- stm(inaug_dfm, K = 10, content =~ President, prevalence =~ Year + President, max.em.its = 10)

## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..........
##   Recovering initialization...
##      ............................................................................................
## Initialization complete.
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (148 seconds). 
## Completing Iteration 1 (approx. per word bound = -8.123) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (176 seconds). 
## Completing Iteration 2 (approx. per word bound = -7.218, relative change = 1.114e-01) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (226 seconds). 
## Completing Iteration 3 (approx. per word bound = -7.214, relative change = 5.719e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (227 seconds). 
## Completing Iteration 4 (approx. per word bound = -7.210, relative change = 5.922e-04) 
## .......................................................................................................
## Completed E-Step (0 seconds). 
## ....................................................................................................
## Completed M-Step (229 seconds). 
## Model Converged

STM enthält viele weitere nützliche Funktionen, z. B. zur Auswahl des besten Modells oder zur Berechnung der Anzahl der Themen (K). Schaut euch hierfür die Vignette (von http://www.structuraltopicmodel.com/) und die Hilfedateien für das stm-Paket an!

Topic Models: LDA und STM

Philipp Meyer, Institut für Politikwissenschaft

23.06.2021, Topic Models, Seminar: Quantitative Textanalyse

LDA