Taming our Wild Data: On Intercoder Reliability in Discourse Research

Renske van Enschot; Wilbert Spooren; Antal van den Bosch; Christian Burgers; Liesbeth Degand; Jacqueline Evers-Vermeul; Florian Kunneman; Christine Liebrecht; Yvette Linders; Alfons Maes

doi:10.51751/dujal16248

Author(s)

Renske van Enschot Tilburg University, Department of Communication and Cognition https://orcid.org/0000-0003-0692-4231
Wilbert Spooren Centre for Language Studies, Radboud University https://orcid.org/0000-0002-2982-3970
Antal van den Bosch Institute for Language Sciences, Utrecht University https://orcid.org/0000-0003-2493-656X
Christian Burgers Amsterdam School of Communication Research (ASCoR), University of Amsterdam https://orcid.org/0000-0002-5652-9021
Liesbeth Degand Institute for Language and Communication, University of Louvain https://orcid.org/0000-0003-1062-9243
Jacqueline Evers-Vermeul Institute for Language Sciences, Utrecht University
Florian Kunneman Dept. Computer Science, Social AI, VU University Amsterdam https://orcid.org/0000-0002-1932-3200
Christine Liebrecht Tilburg center for Cognition and Communication, Tilburg University https://orcid.org/0000-0002-6621-2212
Yvette Linders Centre for Language Studies, Radboud University
Alfons Maes Tilburg center for Cognition and Communication, Tilburg University https://orcid.org/0000-0003-0970-7363

DOI:

https://doi.org/10.51751/dujal16248

Keywords:

discourse, quantitative content analysis, complex discourse data, hands-on procedures, intercoder reliability

Abstract

Many research questions in the field of applied linguistics are answered by manually analyzing data collections or corpora: collections of spoken, written and/or visual communicative messages. In this kind of quantitative content analysis, the coding of subjective language data often leads to disagreement among raters. In this paper, we discuss causes of and solutions to disagreement problems in the analysis of discourse. We discuss crucial factors determining the quality and outcome of corpus analyses, and focus on the sometimes tense relation between reliability and validity. We evaluate formal assessments of intercoder reliability. We suggest a number of ways to improve the intercoder reliability, such as the precise specification of the variables and their coding categories and carving up the coding process into smaller substeps. The paper ends with a reflection on challenges for future work in discourse analysis, with special attention to big data and multimodal discourse.

Downloads

Download data is not yet available.

References

Arts, A., Maes, A., Noordman, L. G. M., & Jansen, C. (2011). Overspecification in written instruction. Linguistics, 49(3), 555–574. https://doi.org/10.1515/ling.2011.017 DOI: https://doi.org/10.1515/ling.2011.017

Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596. https://doi.org/10.1162/coli.07-034-R2 DOI: https://doi.org/10.1162/coli.07-034-R2

Bateman, J. (2008). Multimodality and Genre: A Foundation for the Systematic Analysis of Multimodal Documents. Springer. DOI: https://doi.org/10.1057/9780230582323_5

Bateman, J. A., & Hiippala, T. (2021). From data to patterns. In J. Pflaeging, J. Wildfeuer, J. A. Bateman (Eds.), Empirical Mulitmodality Research. Methods, Evaluations, Implications (pp. 65–90). De Gruyter. https://doi.org/10.1515/9783110725001-003 DOI: https://doi.org/10.1515/9783110725001-003

Bateman, J. A., & Wildfeuer, J. (2014). A multimodal discourse theory of visual narrative. Journal of Pragmatics, 74, 180–208. https://doi.org/10.1016/j.pragma.2014.10.001 DOI: https://doi.org/10.1016/j.pragma.2014.10.001

Bayerl, P. S., & Paul, K. I. (2011). What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Computational Linguistics, 37(4), 699–725. https://doi.org/10.1162/COLI_a_00074 DOI: https://doi.org/10.1162/COLI_a_00074

Bolly, C., Crible, L., Degand, L., & Uygur, D. (2014). Towards a model for discourse marker annotation in spoken French: From potential to feature-based discourse markers. In A. Sansó & C. Fedriani (Eds.), Pragmatic markers, discourse markers and modal particles: What do we know and where do we go from here? (pp. 71–97). Benjamins. https://dial.uclouvain.be/pr/boreal/object/boreal:161997 DOI: https://doi.org/10.1075/slcs.186.03bol

Brône, G., & Oben, B. (2015). InSight Interaction: A multimodal and multifocal dialogue corpus. Language Resources and Evaluation, 49(1), 195–214. https://doi.org/10.1007/s10579-014-9283-2 DOI: https://doi.org/10.1007/s10579-014-9283-2

Burgers, C., Konijn, E. A., & Steen, G. J. (2016). Figurative framing: Shaping public discourse through metaphor, hyperbole, and irony. Communication Theory, 26(4), 410–430. https://doi.org/10.1111/comt.12096 DOI: https://doi.org/10.1111/comt.12096

Burgers, C., Van Mulken, M., & Schellens, P. J. (2011). Finding irony: An introduction of the Verbal Irony Procedure (VIP). Metaphor and Symbol, 26(3), 186–205. https://doi.org/10.1080/10926488.2011.583194 DOI: https://doi.org/10.1080/10926488.2011.583194

Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423–429. https://doi.org/10.1016/0895-4356(93)90018-V DOI: https://doi.org/10.1016/0895-4356(93)90018-V

Cardoso, B., & Cohn, N. (2022). The Multimodal Annotation Software Tool (MAST). Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 6822–6828. DOI: https://doi.org/10.31219/osf.io/3vpce

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104 DOI: https://doi.org/10.1177/001316446002000104

Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220. https://doi.org/10.1037/h0026256 DOI: https://doi.org/10.1037/h0026256

De Smedt, T., & Daelemans, W. (2012). Pattern for python. The Journal of Machine Learning Research, 13, 2063–2067.

Di Eugenio, B., & Glass, M. (2004). The Kappa statistic: A second look. Computational Linguistics, 30(1), 95–101. https://doi.org/10.1162/089120104773633402 DOI: https://doi.org/10.1162/089120104773633402

Elmes, D. G., Kantowitz, B. H., & Roediger, H. L. I. (2012). Research Methods in Psychology (9th ed). Wadsworth Cengage Learning.

Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low Kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. https://doi.org/10.1016/0895-4356(90)90158-L DOI: https://doi.org/10.1016/0895-4356(90)90158-L

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382. https://doi.org/10.1037/h0031619 DOI: https://doi.org/10.1037/h0031619

Forceville, C. (Charles), & Urios-Aparisi, E. (Eds.). (2009). Multimodal Metaphor. Mouton de Gruyter. DOI: https://doi.org/10.1515/9783110215366

Fort, K., Nazarenko, A., & Rosset, S. (2012). Modeling the complexity of manual annotation tasks: A grid of analysis. In Proceedings of the International Conference on Computational Linguistics (COLING 2012) (pp. 895–910). https://hal.science/hal-00769631

Grove, W. M., Andreasen, N. C., McDonald-Scott, P., Keller, M. B., & Shapiro, R. W. (1981). Reliability studies of psychiatric diagnosis: Theory and practice. Archives of General Psychiatry, 38(4), 408–413. https://doi.org/10.1001/archpsyc.1981.01780290042004 DOI: https://doi.org/10.1001/archpsyc.1981.01780290042004

Hancock, J. T., Curry, L. E., Goorha, S., & Woodworth, M. (2007). On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes, 45(1), 1–23. https://doi.org/10.1080/01638530701739181 DOI: https://doi.org/10.1080/01638530701739181

Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89. https://doi.org/10.1080/19312450709336664 DOI: https://doi.org/10.1080/19312450709336664

Hoek, J., Sanders, T., & Spooren, W. (2021). Automatic coherence analysis of Dutch: Testing the subjectivity hypothesis on a larger scale. Corpora, 16(1), 129–155. https://doi.org/10.3366/cor.2021.0211 DOI: https://doi.org/10.3366/cor.2021.0211

Jeni, L. A., Cohn, J. F., & De La Torre, F. (2013). Facing imbalanced data – Recommendations for the use of performance metrics. 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 245–251. https://doi.org/10.1109/ACII.2013.47 DOI: https://doi.org/10.1109/ACII.2013.47

Krippendorff, K. (2011). Agreement and information in the reliability of coding. Communication Methods and Measures, 5(2), 93–112. DOI: https://doi.org/10.1080/19312458.2011.568376

Krippendorff, K. (2019). Content Analysis: An Introduction to its Methodology (4th ed.). SAGE. DOI: https://doi.org/10.4135/9781071878781

Krippendorff, K., Mathet, Y., Bouvry, S., & Widlöcher, A. (2016). On the reliability of unitizing textual continua: Further developments. Quality & Quantity, 50(6), 2347–2364. https://doi.org/10.1007/s11135-015-0266-1 DOI: https://doi.org/10.1007/s11135-015-0266-1

Kunneman, F., Liebrecht, C., Van Mulken, M., & Van den Bosch, A. (2015). Signaling sarcasm: From hyperbole to hashtag. Information Processing & Management, 51(4), 500–509. https://doi.org/10.1016/j.ipm.2014.07.006 DOI: https://doi.org/10.1016/j.ipm.2014.07.006

Liebrecht, C. (2015). Intens Krachtig. Stilistische Intensiveerders in Evaluatieve Teksten [Intensely Powerful. Stylistic Intensifiers in Evaluative Texts.]. [Doctoral dissertation, Radboud Universiteit]. https://hdl.handle.net/2066/141116

Linders, Y. (2014). Met Waardering Gelezen. Een Nieuw Analyse-instrument en een Kwantitatieve Analyse van Evaluaties in Nederlandse Literaire Dagbladkritiek, 1955-2005 [Read with Appreciation. A New Instrument of Analysis and a Quantitative Analysis of Evaluations in Literary Reviews in Dutch Daily Newspapers]. [Doctoral dissertation, Radboud Universiteit]. https://hdl.handle.net/2066/131544

Martin, L. J., Degand, L., & Simon, A.-C. (2014). Forme et fonction de la périphérie gauche dans un corpus oral multigenres annoté. Corpus, 13. https://doi.org/10.4000/corpus.2509 DOI: https://doi.org/10.4000/corpus.2509

Mathet, Y., Widlöcher, A., Fort, K., François, C., Galibert, O., Grouin, C., Kahn, J., Rosset, S., & Zweigenbaum, P. (2012). Manual corpus annotation: Giving meaning to the evaluation metrics, 809. https://hal.science/hal-00769639

Mathet, Y., Widlöcher, A., & Métivier, J.-P. (2015). The unified and holistic method Gamma (γ) for inter-annotator agreement measure and alignment. Computational Linguistics, 41(3), 437–479. DOI: https://doi.org/10.1162/COLI_a_00227

Mol, L., Krahmer, E., Maes, A., & Swerts, M. (2012). Adaptation in gesture: Converging hands or converging minds? Journal of Memory and Language, 66(1), 249–264. https://doi.org/10.1016/j.jml.2011.07.004 DOI: https://doi.org/10.1016/j.jml.2011.07.004

Mordecai, C. (2023). #anxiety: A multimodal discourse analysis of narrations of anxiety on TikTok. Computers and Composition, 67, 102763. https://doi.org/10.1016/j.compcom.2023.102763 DOI: https://doi.org/10.1016/j.compcom.2023.102763

Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5–12. https://doi.org/10.3102/0013189X023002005 DOI: https://doi.org/10.3102/0013189X023002005

Neuendorf, K. A. (2002). The Content Analysis Guidebook. SAGE.

OpenAI. (2023). GPT-4 Technical Report. https://doi.org/10.48550/ARXIV.2303.08774

Pasma, T. (2011). Metaphor and register variation: The personalization of Dutch news discourse [Doctoral dissertation, VU University]. https://research.vu.nl/en/publications/metaphor-and-register-variation-the-personalization-of-dutch-news

Peña, E. D. (2007). Lost in translation: Methodological considerations in cross-cultural research. Child Development, 78(4), 1255–1264. https://doi.org/10.1111/j.1467-8624.2007.01064.x DOI: https://doi.org/10.1111/j.1467-8624.2007.01064.x

Perreault, Jr., W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26, 135–148. DOI: https://doi.org/10.1177/002224378902600201

Potter, W. J., & Levine‐Donnerstein, D. (1999). Rethinking validity and reliability in content analysis. Journal of Applied Communication Research, 27(3), 258–284. https://doi.org/10.1080/00909889909365539 DOI: https://doi.org/10.1080/00909889909365539

Quiros-Ramirez, M. A., & Onisawa, T. (2015). Considering cross-cultural context in the automatic recognition of emotions. International Journal of Machine Learning and Cybernetics, 6(1), 119–127. https://doi.org/10.1007/s13042-013-0192-2 DOI: https://doi.org/10.1007/s13042-013-0192-2

Reijnierse, G., Grunwald, J., & Spooren, W. (in preparation). MetRobbert: Automatic metaphor identification in Dutch.

Scholman, M. C. J., Evers-Vermeul, J., & Sanders, T. J. M. (2016). Categories of coherence relations in discourse annotation. Dialogue & Discourse, 7(2), 2. https://doi.org/10.5087/dad.2016.201 DOI: https://doi.org/10.5087/dad.2016.201

Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19(3), 321–325. https://doi.org/10.1086/266577 DOI: https://doi.org/10.1086/266577

Selvi, A. F. (2020). Qualitative content analysis. In H. Rose & J. McKinley (Eds.), The Routledge Handbook of Research Methods in Applied Linguistics (pp. 440–452). Routledge. DOI: https://doi.org/10.4324/9780367824471-37

Sloetjes, H. (2014). ELAN: Multimedia Annotation Application. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology. Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199571932.013.019 DOI: https://doi.org/10.1093/oxfordhb/9780199571932.013.019

Spooren, W., & Degand, L. (2010). Coding coherence relations: Reliability and validity. 6(2), 241–266. https://doi.org/10.1515/cllt.2010.009 DOI: https://doi.org/10.1515/cllt.2010.009

Steen, G. J. (Ed.) (2018). Visual Metaphor: Structure and process. John Benjamins. https://doi.org/10.1075/celcr.18 DOI: https://doi.org/10.1075/celcr.18

Suviranta, R., & Hiippala, T. (2022). Commercial crowdsourcing in digital humanities: Digital Humanities, 576–578. https://dh2022.dhii.asia/dh2022bookofabsts.pdf

Taboada, M., & Habel, C. (2013). Rhetorical relations in multimodal documents. Discourse Studies, 15(1), 65–89. DOI: https://doi.org/10.1177/1461445612466468

Trochim, W. M. K. (2006). Reliability & Validity. https://conjointly.com/kb/reliability-and-validity/

Umesh, U. N., Peterson, R. A., & Sauber, M. H. (1989). Interjudge agreement and the maximum value of Kappa. Educational and Psychological Measurement, 49(4), 835–850. https://doi.org/10.1177/001316448904900407 DOI: https://doi.org/10.1177/001316448904900407

Van den Bergh, H., Van Es, A., & Spijker, S. (2011). Spelling op verschillende niveaus: Werkwoordspelling aan het eind van de basisschool en het einde van het voortgezet onderwijs [Spelling at different levels: Verb spelling at the end of primary education and at the end of secondary education]. Levende Talen Tijdschrift, 12(1), 3–14.

Van Mulken, M, & Schellens, P. J. (2012). Over loodzware bassen en wapperende broekspijpen—Gebruik en perceptie van taalintensiverende stijlmiddelen [On weighty basses and fluttering pant legs. Use and perception of intensifying stylistic devices]. Tijdschrift Voor Taalbeheersing, 34(1), 26–53. https://doi.org/10.5117/TVT2012.1.OVER418 DOI: https://doi.org/10.5117/TVT2012.1.OVER418

Van den Bosch, A., Schuurman, I., & Vandeghinste, V. (2006). Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, (LREC-2006).

Van Enschot, R., & Donné, L. (2013). Retorische vormen in gezondheidsvoorlichting [Rhetorical figures in health communication]. In R. J. U. Boogaart & H. Jansen (Eds.), Studies in Taalbeheersing 4 (pp. 91–101). Van Gorcum.

Van Enschot, R., & Hoeken, H. (2015). The occurrence and effects of verbal and visual anchoring of tropes on the perceived comprehensibility and liking of TV commercials. Journal of Advertising, 44(1), 25–36. https://doi.org/10.1080/00913367.2014.933688 DOI: https://doi.org/10.1080/00913367.2014.933688

Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed). Butterworths.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2018). Attention Is All You Need. In U. von Luxburg, I. Guyon, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Red.), Advances in neural information processing systems 30: 31st Annual Conference on Neural Information Processing Systems (NIPS 2017): Long Beach, California, USA, 4-9 December 2017 (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Vis, K. (2011). Subjectivity in News Discourse: A Corpus Linguistic Analysis of Informalization [Doctoral dissertation, Vrije Universiteit Amsterdam]. https://research.vu.nl/en/publications/subjectivity-in-news-discourse-a-corpus-linguistic-analysis-of-in

Wallace, B. C. (2015). Computational irony: A survey and new perspectives. Artificial Intelligence Review, 43(4), 467–483. https://doi.org/10.1007/s10462-012-9392-5 DOI: https://doi.org/10.1007/s10462-012-9392-5

Webber, B., Prasad, R., Lee, A., & Joshi, A. (2019). The Penn Discourse Treebank 3.0 Annotation Manual. https://catalog.ldc.upenn.edu/docs/LDC2019T05/PDTB3-Annotation-Manual.pdf

Zufferey, S., & Degand, L. (2013). Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguistics and Linguistic Theory, 13(2), 399–422. https://doi.org/10.1515/cllt-2013-0022 DOI: https://doi.org/10.1515/cllt-2013-0022

Taming our Wild Data

On Intercoder Reliability in Discourse Research

Author(s)

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Stay up-to-date