Research Impact

Paper Miner: Big Questions in History

What if historians could mine newspaper articles going back to the early days of Australian colonial history, and what if each of those articles were georeferenced – the content of each story pinpointed in time and space? With Paper Miner, the first steps are being taken so that this is what historians can do.


Like most research projects it started with a conversation and a question. Well, two conversations really. The first conversation took place about a year ago between Paul Turnbull, Professor of eHistory at The University of Queensland, and Professor Kerry Raymond from the Faculty of Information Technology at QUT. Kerry mentioned to Paul that she had a copy of the textual data created by the National Library of Australia, to enable searching of their online collection of newspapers from the early days of Australian colonisation up until the 1950s when copyright restrictions apply.

The second conversation was between Paul and fellow historians Dr Jonathan Richards and Professor Clive Moore. They wondered if, using this data, it might be possible to map the scale and spread of violence on the Queensland frontier during the 19th century. The extent of violence during European colonisation has been the subject of vitriolic public debate over the interpretation of history in Australia's "history wars." Newspaper coverage at the time, coupled with data mining and visualisation techniques, might allow them to build up a picture of what was coming to public attention at given times and locations. And then the more the three of them talked about it, the more they realised the potential for exploring other questions, such as how people have responded in the past to extreme weather events like cyclones and monsoonal flooding. That's how the idea for Paper Miner was born.

"We had a couple of questions which really interested us," Paul says, "but as we talked we began to appreciate the immense benefits that researchers, students, indeed anyone interested in our history, could gain by our capitalising on the National Library doing this remarkable thing of digitising our national newspaper record. We began to imagine a freely available service enhancing the value of this achievement by exploiting eResearch tools, techniques and infrastructure to visually explore spatially located connections and patterns in historical phenomena over time. We've seen new knowledge gained by analysing incredibly large amounts of data -- by astronomers, for example. And we started to think that this was something we could do, given that the technology was now available to historians, and the eResearch infrastructure and support existed which made it all quite feasible."

The team from UQ, QUT, Griffith University, and the Smart Services CRC set to work to build it. "We got a long way pretty quickly," says Paul. "This was due to working with QUT researchers Sangeetha Kutty, Richi Nayak, Ron Chernich and Richard Thomas - all incredibly smart people in the fields of computer and information science. They worked in a very dedicated way on the project. We had a small amount of money. We had great support from Gavan Kennedy and the Smart Services CRC, who were absolutely terrific in seeing the potential of a project like this."

It wasn't long into the project when they realised they would need more storage. The original text files from the National Library were around 400 Gigabytes, but the need to show the newspaper content in time and space required the creation of around 77 million index files, bulging the size to 4 Terabytes. Paul had worked with QCIF's eResearch Analyst team at UQ, on a previous project. "The obvious, logical thing was to ask QCIF – Are you able to help? And the response was immediately – Yes we are. That was good news," Paul says. The Paper Miner data is on QRIScloud, operated by QCIF.

Paul has been using digital technologies in history teaching and research since the mid-1990s. He believes digitisation and technology infrastructure are opening up new avenues for research in the humanities. "We're now beginning to ask questions about big issues which are confronting us, such as the historical and cultural dimensions of climate change. The historical record is something we should be using to understand the true complexity of phenomena such as human resilience and adaptation to changing material conditions of life over time. In short, we are beginning to ask questions of what are potentially very large data sets, which will see us as much involved in eResearch as anyone in the sciences. In fact we are already beginning to sense affinities between bioinformatics and its interest in understanding biological complexity, and the kinds of social and cultural complexity which are engaging a growing number of researchers in the humanities. I suspect that in 10-15 years the humanities will be the greatest users of eResearch infrastructure in Australia."