A case study in online journalism: preliminary findings and unfolding digital humanities methods

Ms Kim Doyle1, Dr Mitchell Harrop1

1University Of Melbourne , Melbourne, Australia


Large-scale datasets of communication are challenging traditional, human-driven approaches to content analysis in media and communications research (Lewis et al. 2013, p.34). With online journalism and social media producing huge amounts of digital content daily, media and communication scholars are faced with the new challenge to describe and analyse this wealth of information (Günther & Quandt 2016, p.75). The cost of this research phase can impose limitations on sample sizes, and therefore the kind of research questions that can be addressed (Flaounas et al. 2013, p.102). Boumans and Trilling (2016, p.8) insist that ‘the sheer amount of data and the unique features of digital content call for the application of valuable new techniques.’ Yet, despite the unfolding opportunities, computational methods are not currently commonplace in digital journalism research (Boumans & Trilling 2016, p.8). Perhaps this is due to the ‘fluid and ephemeral’ (Karlsson & Sjøvaag 2016, p.179) nature of online journalism, which is currently ‘more akin to a flowing river that Web scrapers or algorithms can step into at a fixed point in time’ (ibid, p.186). In the presentation we will argue that using methodologies from computer science and computational linguistics has the potential to help disclose patterns; consistencies and organisational factors of online news production that would otherwise remain undiscovered with traditional methods alone.

This presentation is a case study in an interdisciplinary project in the field of Media Studies, which forms part of the first author’s ongoing doctoral research at The University of Melbourne. The research project is designed to explore large-scale data mining of multiple news websites, as well as utilising the Natural Language Toolkit as an analytical tool. Such a broad approach is supported by Karlsson & Sjøvaag, who argue that for reasons of methodological veracity, ‘researchers should ideally command both the access, collection and storage of digital news data’ (Karlsson & Sjøvaag 2016, p.187).

Tailoring code and data acquisition is often costly and/or time consuming; therefore ‘it is imperative that researchers share their computational solutions with other researchers’ (p. 187, ibid). As such, the research project may develop a platform that may be relevant and reusable for future Digital Humanities researchers. Yet reuse opportunities are not sufficient for quality research. Karlsson and Sjøvaag urge researchers ‘to publish more widely on the process of inductive method design rather than merely disseminating results from the analysis’ (Karlsson & Sjøvaag 2016, p.187). Moving towards standards for large-scale analysis of communications data therefore requires greater attention to the processes of preparing and conducting digital data collection and analysis.

The project addresses these duel concerns by constructing a platform using Apache Spark and the Zeppelin Notebook to inductively scrape, analyse and display news data using multiple programming languages, within the same user interface. This builds on the Digital Humanities tradition of ‘building things’ as additional research outputs, but goes further by creating a flexible, scalable and Open Source platform for next generation Digital Humanities scholars. This solution was researcher-driven, with the indispensable support of informatics and research services at The University of Melbourne. The platform is being built in collaboration with the Social Cultural Informatics Platform (SCIP) and Research Platforms, often through a process of mutual learning. The SCIP digital humanities initiative has previously been described at eResearch Australasia (see Neish et al., 2015).

This talk will discuss initial problems encountered during the ongoing research process. For example, when building tools and platforms for Digital Humanities, it is important to recognise that humanities have their own methods ‘not based in calculation, automation, or statistical probability, but in ambiguity, interpretation, and in embodied and situated models of knowledge and knowing’ (Burdick et al. 2012, p.92). Thus, a Humanities and Social Sciences (HASS) platform must be flexible enough to incorporate mixed-methodologies, as well as mixed skill levels and skill specialisations.


Ms Kim Doyle
Kim is a PhD candidate in Media and Communications at the University of Melbourne. Her thesis involves discourse analysis of online news articles with computational linguistics. She teaches in digital media research and natural language processing at Research Platforms at the University of Melbourne.

Dr Mitchell Harrop
Mitchell is a Humanities and Social Sciences Informatics Specialist in the Social and Cultural Informatics Platform (SCIP) at The University of Melbourne. Mitchell’s PhD research involved ethnographic studies of digital game playing and was conducted within Melbourne University’s Computing and Information Systems department. He has lectured in Informatics, Database Systems and Web Information Technologies.

