Towards a Dataset of Automatically Coded Protest Events from English-Language Newswires

Saturday, April 16, 2016
Assembly F (DoubleTree by Hilton Philadelphia Center City)
Peter Makarov , Institut für Computerlinguistik, University of Zurich
Jasmine Lorenzini , European University Institute
Advances in Natural Language Processing allow social scientists to design projects that require extensive coding of textual data. In this paper, we present the design of a protest event dataset that we construct by automated coding of English-language newswire documents and that covers 30 European countries over 15 years. We aim at coding a variety of protest action forms: demonstrations, strikes, riots, boycotts as well as symbolic protest. We discuss our data collection methodology and main challenges that we face. For both identification of relevant documents and coding of event features, we rely on supervised NLP techniques. We describe the preparation of an extensive training set of manually annotated documents.  Identifying protest events is challenging for human coders, let alone machines. We present our strategy for reconciling social scientists' key needs in terms of data reliability and validity with what current NLP technology has to offer. We test the reliability and validity of our event data by comparing it to existing datasets built through fully or semi-automated techniques, and manually coded protest event data on single countries. This methodological paper contributes to advancing interdisciplinary research by presenting a collaborative effort between computational linguists and social scientists.
Paper
  • Lorenzini_etal.pdf (520.3 kB)