A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia

Published on Apr 1, 2013
· DOI :10.1007/978-3-642-35085-6_5
Oliver Ferschke11
Estimated H-index: 11
(Technische Universität Darmstadt),
Johannes Daxenberger14
Estimated H-index: 14
(Technische Universität Darmstadt),
Iryna Gurevych66
Estimated H-index: 66
(Technische Universität Darmstadt)
With the rise of the Web 2.0, participatory and collaborative content production have largely replaced the traditional ways of information sharing and have created the novel genre of collaboratively constructed language resources. A vast untapped potential lies in the dynamic aspects of these resources, which cannot be unleashed with traditional methods designed for static corpora. In this chapter, we focus on Wikipedia as the most prominent instance of collaboratively constructed language resources. In particular, we discuss the significance of Wikipedi’s revision history for applications in Natural Language Processing (NLP) and the unique prospects of the user discussions, a new resource that has just begun to be mined. While the body of research on processing Wikipedia’s revision history is dominated by works that use the revision data as the basis for practical applications such as spelling correction or vandalism detection, most of the work focused on user discussions uses NLP for analyzing and understanding the data itself.
📖 Papers frequently viewed together
2008IJCNLP: International Joint Conference on Natural Language Processing
2009NAACL: North American Chapter of the Association for Computational Linguistics
#1David N. Milne (University of Waikato)H-Index: 16
#2Ian H. Witten (University of Waikato)H-Index: 84
The online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked articles. For developers and researchers it represents a giant multilingual database of concepts and semantic relations, a potential resource for natural language processing and many other research areas. This paper introduces the Wikipedia Miner toolkit, an open-source software system that allows researchers and developers to integrate Wikipedia@?s rich semantics into their own applications. The toolkit cre...
#1Ortega SotoH-Index: 1
#2José FelipeH-Index: 1
Tesis Doctoral leida en la Universidad Rey Juan Carlos en marzo de 2009. Director de la Tesis: Jesus M. Gonzalez-Barahona
Apr 23, 2012 in EACL (Conference of the European Chapter of the Association for Computational Linguistics)
#1Torsten Zesch (Technische Universität Darmstadt)H-Index: 25
We evaluate measures of contextual fitness on the task of detecting real-word spelling errors. For that purpose, we extract naturally occurring errors and their contexts from the Wikipedia revision history. We show that such natural errors are better suited for evaluation than the previously used artificially created errors. In particular, the precision of statistical methods has been largely over-estimated, while the precision of knowledge-based approaches has been under-estimated. Additionally...
Apr 23, 2012 in EACL (Conference of the European Chapter of the Association for Computational Linguistics)
#1Oliver FerschkeH-Index: 11
#2Iryna Gurevych (Technische Universität Darmstadt)H-Index: 66
Last. Yevgen Chebotar (Technische Universität Darmstadt)H-Index: 18
view all 3 authors...
In this paper, we propose an annotation schema for the discourse analysis of Wikipedia Talk pages aimed at the coordination efforts for article improvement. We apply the annotation schema to a corpus of 100 Talk pages from the Simple English Wikipedia and make the resulting dataset freely available for download. Furthermore, we perform automatic dialog act classification on Wikipedia discussions and achieve an average F1-score of 0.82 with our classification pipeline.
#1Sara Javanmardi (UCI: University of California, Irvine)H-Index: 9
#2David W. McDonald (UW: University of Washington)H-Index: 37
Last. Cristina V. Lopes (UCI: University of California, Irvine)H-Index: 54
view all 3 authors...
User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools...
Aug 29, 2011 in DEXA (Database and Expert Systems Applications)
#1Jingyu Han (NUPT: Nanjing University of Posts and Telecommunications)H-Index: 4
#2Chuandong Wang (NUPT: Nanjing University of Posts and Telecommunications)H-Index: 1
Last. Dawei Jiang (NUS: National University of Singapore)H-Index: 11
view all 3 authors...
The collaborative efforts of users in social media services such as Wikipedia have led to an explosion in user-generated content and how to automatically tag the quality of the content is an eminent concern now. Actually each article is usually undergoing a series of revision phases and the articles of different quality classes exhibit specific revision cycle patterns. We propose to Assess Quality based on Revision History (AQRH) for a specific domain as follows. First, we borrow Hidden Markov M...
Jul 27, 2011 in EMNLP (Empirical Methods in Natural Language Processing)
#1Kristian Woodsend (Edin.: University of Edinburgh)H-Index: 11
#2Mirella Lapata (Edin.: University of Edinburgh)H-Index: 82
Text simplification aims to rewrite text into simpler versions, and thus make information accessible to a broader audience. Most previous work simplifies sentences using handcrafted rules aimed at splitting long sentences, or substitutes difficult words using a predefined dictionary. This paper presents a data-driven model based on quasi-synchronous grammar, a formalism that can naturally capture structural mismatches and complex rewrite operations. We describe how such a grammar can be induced ...
Jul 5, 2011 in ICWSM (International Conference on Weblogs and Social Media)
#1David Laniado (Polytechnic University of Milan)H-Index: 17
#2Riccardo Tasso (Polytechnic University of Milan)H-Index: 2
Last. Andreas KaltenbrunnerH-Index: 23
view all 4 authors...
Talk pages play a fundamental role in Wikipedia as the place for discussion and communication. In this work we use the comments on these pages to extract and study three networks, corresponding to different kinds of interactions. We find evidence of a specific assortativity profile which differentiates article discussions from personal conversations. An analysis of the tree structure of the article talk pages allows to capture patterns of interaction, and reveals structural differences among the...
#1Alex Marin (UW: University of Washington)H-Index: 11
#2Bin Zhang (UW: University of Washington)H-Index: 6
Last. Mari Ostendorf (UW: University of Washington)H-Index: 65
view all 3 authors...
This paper explores the problem of detecting sentence-level forum authority claims in online discussions. Using a maximum entropy model, we explore a variety of strategies for extracting lexical features in a sparse training scenario, comparing knowledge- and data-driven methods (and combinations). The augmentation of lexical features with parse context is also investigated. We find that certain markup features perform remarkably well alone, but are outperformed by data-driven selection of lexic...
#1Emily M. Bender (UW: University of Washington)H-Index: 22
#2Jonathan T. Morgan (UW: University of Washington)H-Index: 14
Last. Mari Ostendorf (UW: University of Washington)H-Index: 65
view all 8 authors...
We present the AAWD corpus, a collection of 365 discussions drawn from Wikipedia talk pages and annotated with labels capturing two kinds of social acts: alignment moves and authority claims. We describe these social acts and our annotation process, and analyze the resulting data set for interactions between participant status and social acts and between the social acts themselves.
Cited By12
Apr 1, 2021 in EACL (Conference of the European Chapter of the Association for Computational Linguistics)
#1Alok Debnath (IIIT-H: International Institute of Information Technology, Hyderabad)H-Index: 2
#2Michael Roth (University of Stuttgart)H-Index: 16
WikiHow is an open-domain repository of instructional articles for a variety of tasks, which can be revised by users. In this paper, we extract pairwise versions of an instruction before and after a revision was made. Starting from a noisy dataset of revision histories, we specifically extract and analyze edits that involve cases of vagueness in instructions. We further investigate the ability of a neural model to distinguish between two versions of an instruction in our data by adopting a pairw...
#1Srikar Velichety (U of M: University of Memphis)H-Index: 3
#2Sudha RamH-Index: 36
Last. Jesse BockstedtH-Index: 16
view all 3 authors...
AbstractWe develop a method to assess the quality of peer-produced content in knowledge repositories using their development and coordination histories. We also develop a process to identify releva...
#1Liang Yao (ZJU: Zhejiang University)H-Index: 13
#2Yin Zhang (ZJU: Zhejiang University)H-Index: 15
Last. Yali Bian (ZJU: Zhejiang University)H-Index: 2
view all 7 authors...
We combine topic model with Wikipedia knowledge.We represent a document as a bag of Wikipedia articles.Using Wikipedia page view statistics.Extracting more accurate dynamic patterns with specific and coherent entities.Spending less time by using Wikipedia knowledge. Probabilistic topic models could be used to extract low-dimension aspects from document collections, and capture how the aspects change over time. However, such models without any human knowledge often produce aspects that are not in...
#1Lydia-Mai Ho-DacH-Index: 8
#2Veronika Laippala (UTU: University of Turku)H-Index: 9
Last. Ludovic TanguyH-Index: 12
view all 4 authors...
Wikipedia is a popular and extremely useful resource for studies in both linguistics and natural language processing (Yano and Kang, 2008; Ferschke et al., 2013). This paper introduces a new language resource based on the French Wikipedia online discussion pages, the WikiTalk corpus. The publicly available corpus includes 160M words and 3M posts structured into 1M thematic sections and has been syntactically parsed with the Talismane toolkit (Urieli, 2013). In this paper, we present the first re...
Aug 10, 2015 in KDD (Knowledge Discovery and Data Mining)
#1Srijan Kumar (UMD: University of Maryland, College Park)H-Index: 17
#2Francesca Spezzano (UMD: University of Maryland, College Park)H-Index: 13
Last. V. S. Subrahmanian (UMD: University of Maryland, College Park)H-Index: 71
view all 3 authors...
We study the problem of detecting vandals on Wikipedia before any human or known vandalism detection system reports flagging potential vandals so that such users can be presented early to Wikipedia administrators. We leverage multiple classical ML approaches, but develop 3 novel sets of features. Our Wikipedia Vandal Behavior (WVB) approach uses a novel set of user editing patterns as features to classify some users as vandals. Our Wikipedia Transition Probability Matrix (WTPM) approach uses a s...
#1Fan Zhang (University of Pittsburgh)H-Index: 6
#2Diane J. Litman (University of Pittsburgh)H-Index: 59
This paper explores the annotation and classification of students’ revision behaviors in argumentative writing. A sentence-level revision schema is proposed to capture why and how students make revisions. Based on the proposed schema, a small corpus of student essays and revisions was annotated. Studies show that manual annotation is reliable with the schema and the annotated information helpful for revision analysis. Furthermore, features and methods are explored for the automatic classificatio...
Oct 1, 2013 in EMNLP (Empirical Methods in Natural Language Processing)
#1Johannes Daxenberger (Technische Universität Darmstadt)H-Index: 14
#2Iryna Gurevych (Technische Universität Darmstadt)H-Index: 66
In this paper, we analyze a novel set of features for the task of automatic edit category classification. Edit category classification assigns categories such as spelling error correction, paraphrase or vandalism to edits in a document. Our features are based on differences between two versions of a document including meta data, textual and language properties and markup. In a supervised machine learning experiment, we achieve a micro-averaged F1 score of .62 on a corpus of edits from the Englis...
May 28, 2013 in AI (Canadian Conference on Artificial Intelligence)
#1Amir H. Razavi (U of O: University of Ottawa)H-Index: 8
#2Diana Inkpen (U of O: University of Ottawa)H-Index: 32
Last. Lana BogouslavskiH-Index: 1
view all 4 authors...
In this article, we present a novel document annotation method that can be applied on corpora containing short documents such as social media texts. The method applies Latent Dirichlet Allocation (LDA) on a corpus to initially infer some topical word clusters. Each document is assigned one or more topic clusters automatically. Further document annotation is done through a projection of the topics extracted and assigned by LDA into a set of generic categories. The translation from the topical clu...
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.