Ms Kim Doyle1, Dr Mitchell Harrop1
1University Of Melbourne , Melbourne, Australia
Large-scale datasets of communication are challenging traditional, human-driven approaches to content analysis in media and communications research (Lewis et al. 2013, p.34). With online journalism and social media producing huge amounts of digital content daily, media and communication scholars are faced with the new challenge to describe and analyse this wealth of information (Günther & Quandt 2016, p.75). The cost of this research phase can impose limitations on sample sizes, and therefore the kind of research questions that can be addressed (Flaounas et al. 2013, p.102). Boumans and Trilling (2016, p.8) insist that ‘the sheer amount of data and the unique features of digital content call for the application of valuable new techniques.’ Yet, despite the unfolding opportunities, computational methods are not currently commonplace in digital journalism research (Boumans & Trilling 2016, p.8). Perhaps this is due to the ‘fluid and ephemeral’ (Karlsson & Sjøvaag 2016, p.179) nature of online journalism, which is currently ‘more akin to a flowing river that Web scrapers or algorithms can step into at a fixed point in time’ (ibid, p.186). In the presentation we will argue that using methodologies from computer science and computational linguistics has the potential to help disclose patterns; consistencies and organisational factors of online news production that would otherwise remain undiscovered with traditional methods alone.
This presentation is a case study in an interdisciplinary project in the field of Media Studies, which forms part of the first author’s ongoing doctoral research at The University of Melbourne. The research project is designed to explore large-scale data mining of multiple news websites, as well as utilising the Natural Language Toolkit as an analytical tool. Such a broad approach is supported by Karlsson & Sjøvaag, who argue that for reasons of methodological veracity, ‘researchers should ideally command both the access, collection and storage of digital news data’ (Karlsson & Sjøvaag 2016, p.187).
Tailoring code and data acquisition is often costly and/or time consuming; therefore ‘it is imperative that researchers share their computational solutions with other researchers’ (p. 187, ibid). As such, the research project may develop a platform that may be relevant and reusable for future Digital Humanities researchers. Yet reuse opportunities are not sufficient for quality research. Karlsson and Sjøvaag urge researchers ‘to publish more widely on the process of inductive method design rather than merely disseminating results from the analysis’ (Karlsson & Sjøvaag 2016, p.187). Moving towards standards for large-scale analysis of communications data therefore requires greater attention to the processes of preparing and conducting digital data collection and analysis.
The project addresses these duel concerns by constructing a platform using Apache Spark and the Zeppelin Notebook to inductively scrape, analyse and display news data using multiple programming languages, within the same user interface. This builds on the Digital Humanities tradition of ‘building things’ as additional research outputs, but goes further by creating a flexible, scalable and Open Source platform for next generation Digital Humanities scholars. This solution was researcher-driven, with the indispensable support of informatics and research services at The University of Melbourne. The platform is being built in collaboration with the Social Cultural Informatics Platform (SCIP) and Research Platforms, often through a process of mutual learning. The SCIP digital humanities initiative has previously been described at eResearch Australasia (see Neish et al., 2015).
This talk will discuss initial problems encountered during the ongoing research process. For example, when building tools and platforms for Digital Humanities, it is important to recognise that humanities have their own methods ‘not based in calculation, automation, or statistical probability, but in ambiguity, interpretation, and in embodied and situated models of knowledge and knowing’ (Burdick et al. 2012, p.92). Thus, a Humanities and Social Sciences (HASS) platform must be flexible enough to incorporate mixed-methodologies, as well as mixed skill levels and skill specialisations.
Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., & Schnapp, J. (2012). Digital_Humanities. Mit Press.
Boumans, J.W. & Trilling, D., 2016. Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4(1), pp.8–23.
Flaounas, I., Ali, O., Lansdall-Welfare, T., De Bie, T., Mosdell, N., Lewis, J., & Cristianini, N. (2013). Research Methods in the Age of Digital Journalism: Massive-scale automated analysis of news-content—topics, style and gender. Digital Journalism, 1(1), 102-116.
Günther, E. & Quandt, T., (2016). Word Counts and Topic Models: Automated text analysis methods for digital journalism research. Digital Journalism, 4(1), pp.75–88.
Karlsson, M. & Sjøvaag, H., (2016). Content analysis and online news: epistemologies of analysing the ephemeral Web. Digital Journalism, 4(1), pp.177–192.
Lewis, S.C., Zamith, R. & Hermida, A., (2013). Content Analysis in an Era of Big Data: A Hybrid Approach to Computational and Manual Methods. Journal of broadcasting & electronic media, 57(1), pp.34–52.
Neish, P., Murray, A. & Konstantelos, L. (2015) “The role of research data repositories in social and cultural informatics and the wider open data ecosystem” eResearch Australasia Conference, Brisbane, Australia, 19th-23rd October, 2015. Available: https://eresearchau.files.wordpress.com/2015/07/eresau2015_submission_45.pdf
Dr Mitchell Harrop
Mitchell is a Humanities and Social Sciences Informatics Specialist in the Social and Cultural Informatics Platform (SCIP) at The University of Melbourne. Mitchell’s PhD research involved ethnographic studies of digital game playing and was conducted within Melbourne University’s Computing and Information Systems department. He has lectured in Informatics, Database Systems and Web Information Technologies.