1. Einleitung

Dieses Tutorial soll euch die Wordscores-Methode näherbringen. Hierbei handelt es sich, ähnlich wie bei Naïve Bayse um eine supervised classification. Wordscores wurde maßgeblich von Ken Benoit entwickelt und in quanteda implementiert. Aus diesem Grund werde ich hier das offizielle Tutorial von quanteda bereitstellen. Die Originalquelle findet ihr hier.

2. Offizielles WORDSCORES-Tutorial

Wordscores is a scaling model for estimating the positions (mostly of political actors) for dimensions that are specified a priori. Wordscores was introduced in Laver, Benoit and Garry (2003) and is widely used among political scientists.

library(quanteda)
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 4 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textmodels)
library(quanteda.textstats)
library(quanteda.corpora) # Die Anleitung zur Installierung findet ihr hier: https://github.com/quanteda/quanteda.corpora
library(quanteda.textplots) # Für mehr Infos siehe: https://quanteda.io/articles/pkgdown/examples/plotting.html

Training a Wordscores model requires reference scores for texts whose policy positions on well-defined a priori dimensions are “known”. Afterwards, Wordscores estimates the positions for the remaining “virgin” texts.

We use manifestos of the 2013 and 2017 German federal elections. For the 2013 elections we assign the average expert evaluations from the 2014 Chapel Hill Expert Survey for the five major parties, and predict the party positions for the 2017 manifestos.

corp_ger <- download(url = "https://www.dropbox.com/s/uysdoep4unfz3zp/data_corpus_germanifestos.rds?dl=1")
summary(corp_ger)
## Corpus consisting of 12 documents, showing 12 documents:
## 
##          Text Types Tokens Sentences year   party ref_score
##      AfD 2013   450    944        43 2013     AfD        NA
##  CDU-CSU 2013  7615  46535      2527 2013 CDU-CSU      5.92
##      FDP 2013  7953  42298      2375 2013     FDP      6.53
##   Gruene 2013 13839  93595      5126 2013  Gruene      3.61
##    Linke 2013  8451  43382      1850 2013   Linke      1.23
##      SPD 2013  8360  47348      2532 2013     SPD      3.76
##      AfD 2017  5947  18754       715 2017     AfD        NA
##  CDU-CSU 2017  4890  21510      1256 2017 CDU-CSU        NA
##      FDP 2017  8676  37609      1925 2017     FDP        NA
##   Gruene 2017 13353  72645      3220 2017  Gruene        NA
##    Linke 2017 11830  65728      2755 2017   Linke        NA
##      SPD 2017  8400  41938      2401 2017     SPD        NA

Now we can apply the Wordscores algorithm to a document-feature matrix.

# create a document-feature matrix

dfmat_ger <-  corp_ger %>% 
  tokens(remove_numbers = TRUE, remove_punct = TRUE,remove_symbols = TRUE) %>% 
  tokens_remove(pattern = stopwords("de")) %>% 
  tokens_tolower() %>%
  dfm() # 

# apply Wordscores algorithm to document-feature matrix
tmod_ws <- textmodel_wordscores(dfmat_ger, y = corp_ger$ref_score, smooth = 1)
summary(tmod_ws)
## 
## Call:
## textmodel_wordscores.dfm(x = dfmat_ger, y = corp_ger$ref_score, 
##     smooth = 1)
## 
## Reference Document Statistics:
##              score total min  max    mean median
## AfD 2013        NA   455   0   23 0.01105      0
## CDU-CSU 2013  5.92 22854   0  245 0.55495      0
## FDP 2013      6.53 20497   0  186 0.49772      0
## Gruene 2013   3.61 45244   0  398 1.09864      0
## Linke 2013    1.23 20794   0  234 0.50493      0
## SPD 2013      3.76 22928   0  214 0.55675      0
## AfD 2017        NA  9647   0  108 0.23425      0
## CDU-CSU 2017    NA 10624   0  136 0.25798      0
## FDP 2017        NA 19214   0  261 0.46656      0
## Gruene 2017     NA 40828   0 1086 0.99140      0
## Linke 2017      NA 33004   0  788 0.80142      0
## SPD 2017        NA 20688   0  186 0.50236      0
## 
## Wordscores:
## (showing first 30 elements)
##           alternative           deutschland          wahlprogramm 
##                 3.291                 4.740                 3.295 
##       währungspolitik               fordern             geordnete 
##                 4.529                 3.255                 4.240 
##             auflösung euro-währungsgebietes               braucht 
##                 3.336                 4.240                 4.153 
##                  euro               ländern               schadet 
##                 3.329                 4.227                 3.911 
##      wiedereinführung            nationaler             währungen 
##                 4.463                 4.577                 4.240 
##             schaffung             kleinerer            stabilerer 
##                 4.288                 4.425                 4.240 
##      währungsverbünde                    dm                  darf 
##                 4.240                 4.240                 3.870 
##                  tabu              änderung          europäischen 
##                 4.158                 4.226                 4.358 
##              verträge                 staat           ausscheiden 
##                 3.553                 4.791                 3.697 
##           ermöglichen                  volk          demokratisch 
##                 4.354                 4.240                 2.271

Next, we predict the Wordscores for the unknown virgin texts.

pred_ws <- predict(tmod_ws, se.fit = TRUE, newdata = dfmat_ger)

Finally, we can plot the fitted scaling model using quanteda‘s textplot_scale1d function.

textplot_scale1d(pred_ws)