1. Einleitung

Dieses Tutorial soll euch die Wordfish-Methode näherbringen. Im Unterschied zur Wordscore-Methode basiert Wordfish auf einem unsupervised learning Ansatz.

Ähnlich wie bei Wordscore hat sich vor allem das quanteda-Team darum bemüht, eine sehr gute R-Implementierung zu realisieren. Aus diesem Grund werde ich hier das offizielle Tutorial von der quanteda bereitstellen. Die Originalquelle findet ihr hier.

2. Offizielles WORDFISH-Tutorial

Wordfish is a Poisson scaling model of one-dimensional document positions (Slapin and Proksch 2008). Wordfish also allows for scaling documents, but compared to Wordscores reference scores/texts are not required. Wordfish is an unsupervised one-dimensional text scaling method, meaning that it estimates the positions of documents solely based on the observed word frequencies.

library(quanteda)
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 4 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textmodels)
library(quanteda.textstats)
library(quanteda.corpora) # Die Anleitung zur Installierung findet ihr hier: https://github.com/quanteda/quanteda.corpora
library(quanteda.textplots) # Für mehr Infos siehe: https://quanteda.io/articles/pkgdown/examples/plotting.html

In this example, we show how to apply Wordfish to the Irish budget speeches from 2010. First, we create a document-feature matrix. Afterwards, we run Wordfish.

dfmat_irish <-  data_corpus_irishbudget2010 %>% 
  tokens(remove_numbers = TRUE, remove_punct = TRUE,remove_symbols = TRUE) %>% 
  tokens_remove(pattern = stopwords("english")) %>% 
  tokens_tolower() %>%
  dfm() 

tmod_wf <- textmodel_wordfish(dfmat_irish, dir = c(6, 5))
summary(tmod_wf)
## 
## Call:
## textmodel_wordfish.dfm(x = dfmat_irish, dir = c(6, 5))
## 
## Estimated Document Positions:
##                              theta      se
## Lenihan, Brian (FF)        1.71694 0.02303
## Bruton, Richard (FG)      -0.43672 0.03226
## Burton, Joan (LAB)        -0.99597 0.01819
## Morgan, Arthur (SF)        0.07786 0.02935
## Cowen, Brian (FF)          1.92504 0.02509
## Kenny, Enda (FG)          -0.81200 0.02604
## ODonnell, Kieran (FG)     -0.30679 0.04668
## Gilmore, Eamon (LAB)      -0.37496 0.03247
## Higgins, Michael (LAB)    -1.20318 0.03647
## Quinn, Ruairi (LAB)       -1.23489 0.03503
## Gormley, John (Green)      0.96884 0.07982
## Ryan, Eamon (Green)        0.15616 0.06407
## Cuffe, Ciaran (Green)      0.57038 0.07298
## OCaolain, Caoimhghin (SF) -0.05072 0.03722
## 
## Estimated Feature Scores:
##      presented supplementary  budget house  last   april    said   work    way
## beta    0.3174         1.100 0.06874 0.060 0.279 -0.2068 -0.9513 0.5502 0.3246
## psi    -1.8172        -1.159 2.68241 1.014 0.944 -0.5962 -0.4688 1.0693 1.3726
##       period severe economic distress  today    can  report notwithstanding
## beta  0.5856  1.337   0.5018    1.698 0.1437 0.3711  0.7066           1.698
## psi  -0.2549 -2.140   1.5044   -4.264 0.8054 1.5106 -0.3314          -4.264
##      difficulties   past  eight months    now    road recovery  enormous
## beta        1.248 0.5386  1.698 0.7393 0.3534  0.1302   0.4230 -0.005133
## psi        -1.439 0.8689 -4.264 0.2048 1.5396 -0.1096   0.8337 -1.086163
##       benefit    main political parties    share
## beta -0.02207  0.9451   -0.3507  0.5503 -0.07773
## psi   1.34103 -0.9040   -0.4458 -0.1179 -0.17386

We can plot the results of a fitted scaling model using textplot_scale1d().

textplot_scale1d(tmod_wf)

The function also allows to plot scores by a grouping variable, in this case the party affiliation of the speakers.

textplot_scale1d(tmod_wf, groups = dfmat_irish$party)

Finally, we can plot the estimated word positions and highlight certain features.

textplot_scale1d(tmod_wf, margin = "features", 
                 highlighted = c("government", "global", "children", 
                                 "bank", "economy", "the", "citizenship",
                                 "productivity", "deficit"))