The Trouble with Text Mining: And why some projects take a long time, and future projects might take less time

by royalhistsoc | Mar 14, 2023 | General, Guest Posts, Historical Research in the Digital Age | 0 comments

In this post we conclude our series — ‘Historical Research in the Digital Age’ — which explores six historians’ use and understanding of the digital tools and sources that shape modern research culture. The series investigates the impact and implications of digital resources (positive and negative) for how historians work today.

In Part Six we hear from Jo Guldi who is Professor of History at Southern Western University in the US. In this extended article, Jo reflects on her long experience of working with digital sources and tools as a historian, with particular focus on the opportunities and challenges inherent within text mining for historical research.

As Jo shows through the development of her own research and publications, historians’ engagement with digital practice remains contingent on developments in computing and data science. Only now are these being worked into a coherent methodology for digital history. As historians, Jo argues, we need to remain open to changes of direction, prompted by digital innovation, while also remaining grounded in the physical archive that digital may enhance but not replace.

The ‘Historical Research in the Digital Age’ series is hosted by Ian Milligan from the University of Waterloo, Ontario, whose new book, The Transformation of Historical Research in the Digital Age, is available as a free Open Access download from Cambridge University Press. All six contributions to the Royal Historical Society series are now available.

Introduction and Context

The technology seemed to be in place. Biologists and computer scientists assured me that topic modelling was revolutionizing studies in their fields. I had listened to the digital humanists, who assured me the technology was ready for the analysis of longitudinal historical data.

This was back in 2008-10, when I was planning a project on the global history of landownership which I published in 2022 as The Long Land War. The Global Struggle for Occupancy Rights. At the outset, I was persuaded by the recent launch of Google Books (in late 2004) that new methods of text mining could address problems of historical analysis. One of my first articles had used word count in Google Books to study how strangers moved down the street in nineteenth-century London. Could not similar techniques be applied to the study of property?

I put in place the tactics recommended to me at the time: investing in a high-quality camera, tripod, automatic shutter release, and high-powered lights, costing about $1200; taking this set up into the archives; and photographing thousands of pages of text. I trained software (another c. $750) on the resulting images to recognize the text. I bumbled along this way for months, none of which were lost. The entire time in the archive, I was also reading and taking notes in the traditional way. But when I fed the images to a computer, the results were meaningless nonsense.

Neither the images nor the software in my setup were good enough. For instance, here is the transcription of what my software was able to extract from a typewritten letter sent by John Lossing Buck to Rainer Schickele, from the archives of the United Nations’ Food and Agriculture Office (FAO):

D681’ lIr. Schickele,

In ‘IIrf hute to n-i:be TR on J1I118 20 SCI that. the letwr would bs travelling over the week~’.. I n”,lected. to s<Q’ that Orade F-4 begins at $7)00 with rmnual 1ncrementi 0:£ $225 to #2,0 \\1P to ‘9500.

In the outlina of the ch.al’t at: the Laad and Water Use !lranch U’CI~r the subject “Appraisal of Lmd and Water Reso1lrces” the vacant POI!t about 1I’h1ch I am in correap~ n th yell, hes the title “Resource Appraisal” • In general, the Brl/lllch uOI!’gardzed on the bas:l.B of _ man havinc O”nll’5ight of work dme in … ~ti.ltlar general field of activiV. In .501118 eafil., there nwy be a variatl.on Il_ tosped.al reaaolUl. ! hall hoped. the Appraisal of Land and W..t~ iIt~.. cll\\l.ld be org&n!zed :IJl the . _ Wlq and the pusan :l.’:U11ng the vaoant 1’011’\\ umier that heading WI)ul.d. be the logical ons,to have c:enerel ov”rqI~ht in tld.E activity.

In the parlance of text mining, these are ‘dirty’ results. While a few recognizable words are obvious, computational identification of phrases would probably fail to locate the name of Schickele’s division at the FAO—the Land and Water Use Branch—here transcribed as ‘the Laad and Water Use !lranch.’ The general subject-matter of the letter, the ‘appraisal’ of land values, can be ascertained, but the actual reason for the letter—the original offer of employment to Schickele—is impossible to determine.

Though these initial results were disappointing, this had not been wasted time. The images I’d created of the documents were highly legible. I had read many of them and taken notes while in the archive. Now, like many researchers in the field, I had a digital transcription of documents I’d read hastily, and I would be able to return to these archival transcriptions in the years to come as I began to write about the contents of the archive. This was an important early lesson: working digitally and away from the archive goes hand-in-hand with traditional archival work undertaken on site.

There were better results from archives where I had access to regular scanning equipment, for instance at the University of Sussex or the papers of Participatory Research in Asia, a nonprofit headquartered in New Delhi. I fed these reports through a geoparser algorithm that was part of the ‘Paper Machines’ pipeline designed by Cora Johnson-Roberson and myself for working with Zotero. This algorithm was trained to recognize place names, enabling me to create a heatmap for the participatory research movement’s global activities.

This image never became part of the manuscript of my book, The Long Land War, although in important ways it supported the research that followed. I had begun already to glimpse how globally interconnected the participatory mapping movements of the 1980s and 1990s were. The heat map underscored for me that conversations about participatory research and mapping were present in virtually every nation in the developing world, alongside a nexus of academic researchers primarily located in the United States, UK and India. Eventually, I merely described these networks, naming nations, individuals, and important documents. But the map—like many of the first data-driven visualizations I made in the Paper Machines era—powerfully steered my sense of what was in the archive, giving me the confidence to draw big conclusions about the archive.

There were dozens of analytics of this kind. I counted mentions of ‘land reform’ and ‘agrarian reform’ and related keywords across the social science disciplines on JSTOR. I counted allusions to ‘landlords’ and ‘tenants’. I used topic modeling to index how ideas about property were changing. Whenever I heard about a new kind of textual analysis, I read again.

“The creation of new tools for reckoning with libraries and archives has transformed the practice of ‘traditional’ historical research for almost everyone in practice today, whether or not they consider themselves ‘digital historians’.”

But none of these searches, nor any of the visualizations generated, made their way into The Long Land War. In many ways, my path towards this monograph mirrored that of other modern researchers described by Ian Milligan in The Transformation of Historical Research in the Digital Age (2022).[1] Here Milligan traces the many ways that manuscripts, journal articles, and other sources—once accessed primarily via paper copies at the library—have been digitalized, creating new opportunities for researchers to make connections across global regions, genres of publication, and time periods.[2]

It’s Milligan’s contention—and I think he is persuasive—that the creation of new tools for reckoning with libraries and archives has transformed the practice of ‘traditional’ historical research for almost everyone in practice today, whether or not they consider themselves ‘digital historians’. This was certainly true in my experience—but I was also leaning into the new methods. Indeed, at the beginning of my project, I had expected that The Long Land War would mainly be forged atop the new tools of text mining.

But the methods of text mining that I have published about took a long time to develop. The rest of this article explains what happened when my initial forays into digital text mining went aground, providing background and insight, but few publishable results. Let me be the first to confess that the maturity of text-mining approaches took a long time—longer than I originally expected. I can confess the delays that new forms of research cost me, and I am willing to detail the blind alleys into which research sent me.

How Historical Practices Mature

In 2010, when I began in earnest to research a book on the long-term history of landownership, and to imagine that digital methods might support my work, there were few manuals available about how to proceed. I had to survey the different available analytics and read up on how scholars in literature and political science were applying textual analysis to the longue durée. There wasn’t a clear case outlining which of those studies were persuasive and which needed more refinement to meet the standards of history. I had to make up my own mind.

One of the issues clearly emerging in this literature was a renewed concern with data quality. Elsewhere in the digital humanities, other scholars were also reckoning with issues of quality and what historical research required. Where researchers were creating new data from scratch, often this process consumed years of work in the context of highly collaborative initiatives with computational researchers. Years of data-cleaning were required before analysis could be performed.

While I was researching The Long Land War, Ryan Cordell was working alongside computer scientists who process visual information for his study of the Library of Congress’s newspaper collection—one that’s notoriously difficult for computers to read. Joris van Eijnatten was documenting the challenges that face a reinvigorated, data-driven global history, based on his observations of working with the Dutch newspapers.[3] Facing similar problems at the British Library, Ruth Ahnert and her group have done an enormous amount of work to document the need for such permanent investments in national libraries and universities as a support to scholarship.[4] And indeed enormous successes have accrued to teams at the University of Helsinki who applied genome-detection algorithms to the computer data.

Today, teams at the universities of Princeton, Vanderbilt, Lancaster, Waterloo, Virginia, Richmond, Berkeley, and Johns Hopkins are pioneering such work on behalf of historians and historically-oriented researchers. Yet for many of the major newspaper collections of interest to historians, data quality still lags behind the standard required for trustworthy research. The methodology has advanced, but it has not yet affected historical research.

Meanwhile, other scholars were asking questions about how and why reliable research mattered. Ryan Cordell turned the messy transcription of Edgar Allan Poe into a witty commentary on the multiple streams of possible misrepresentation generated by insufficient computational processes.[5] But for me as a historian, reckoning not with manifold meanings but with historical fact, the situation was less an inspiration to wit than discomfort. I could hardly trust counts of words extracted from document transcription so poor in quality.

“Working in an underdeveloped field cost me time in the form of heading down blind alleys that would not materialize into the kind of project I had originally imagined.”

Today, researchers who are beginning a digital project can also consult a variety of scholarly publications that lay out with quality the difficulties that a scholar will face if they choose to digitalize original documents.[6] It now seems probable to me that had a Digital Humanities Faculty been in place, at the time—at any of the universities where I did my research—they would have given me certain cautions on my ambitions from the beginning.

When I began working, however, most of these supports were still in the future. Working in an underdeveloped field cost me time in the form of heading down blind alleys that would not materialize into the kind of project I had originally imagined.

But the advances in the digitalization of archives and in pedagogy are only two parts of a much broader set of developments. Equally important is the arrival of a current toolkit of text mining applied to questions of history—that is, the process of applying statistics and machine learning to make discoveries about event, period, memory, and other building-blocks of historical knowledge. Even while I bumbled through questions about the digitalization of archives, lacking pedagogical focus, I had an opportunity to contribute to the subfield of digital history that is text mining for historical research.

Text Mining for Historical Research

Text mining for historians took a long time to develop as a field, although it is now maturing at great strides. As of the past three years, digital history can boast three journals; and there are dozens of journal articles in the field of history.[7] In 2020, Luke Blaxill became the first author to publish a historical monograph—The War of Words—that used text mining as its principal methodology.[6] My own approaches are forthcoming in a book from Cambridge in summer 2023, The Dangerous Art of Text Mining. A Methodology for Digital History, which is itself an outcome from my research experience of succeeding and failing with digital text analysis.

Because of the debates that followed the publication of my previous book The History Manifesto (with David Armitage, 2014), I was beginning to understand that the humanities wouldn’t be satisfied by partial results. If I produced a beautiful digital archive, but couldn’t analyze it accurately with mastery of the algorithms, my results would be dismissed. Interesting results from text mining would make little impact unless a more robust analysis of historical change and new methods could be theorized.

My early experience attempting to create a digitalized archive for The Long Land War enabled me to contemplate existential issues around the future of digital methods in history departments. Was there any future at all for a historian using text mining as a method? I did not principally want to make my own mark in digitalizing an archive unless I knew with certainty that the tools of digital analysis would bear fruit once applied. I realized that I had gotten ahead of myself. Assembling a new database of clean text was one issue; developing a rigorous practice of algorithmic analysis was another. With this concession in mind, I took a new approach.

Having stumbled already, the question I urgently needed to answer was whether text mining itself would work for historical analysis in general. I had already published one article with good results from word counts of the Google Books corpus. New machine learning algorithms and statistical methods came recommended, and offered me a path for a nuanced distant reading of property. Early experimental seminars with graduate students working on the JSTOR data of 20^th-century social science publications produced modest outcomes. But I had nothing so solid as a clear methodology—that is, an understanding of which algorithm applied to data would fit which theoretical aspect of historical analysis.

I also realized that it was important to produce solid proof of concept in the form of a rigorous series of analyses applied to a single digitalized archive where property was debated. Fortunately, one such archive had already been digitalized by previous digital history projects: debates held in the UK Parliament. Instead of digitalizing my own archive and producing a new data set, I would use a dataset produced by others.

“My own digital research began to advance when I whittled down my ambitions about digital archives and expanded my ambitions about methods.”

The maturity of fields is an issue for scholars. If the field of digitalization practices is immature, the scholars most likely to contribute are those attached to major interdisciplinary research units. If the field of text mining is immature, solo researchers supported by interdisciplinary conversations may offer important contributions. However, they will likely have to experiment and theorize the standards of research for the field en route to preparing a monograph based primarily on the results of text mining.

My own digital research began to advance when I whittled down my ambitions about digital archives and expanded my ambitions about methods. I stepped back entirely from the problem of assembling an ideal archive suited to the longue-durée analysis of property. I turned towards the problem of the methods, theory, and first principles of text mining for historical analysis.

While I contemplated how difficult digitalization projects could be, many scholars were hard at work advancing practical approaches for gaining historical knowledge from clean datasets derived from successful digitalization projects of the past. When I decided to join them, my research on the history of property law took off in a new direction.

The Field of Text Mining and How it Grew

I eventually performed a great deal of analysis of the language of property in the nineteenth-century parliamentary debates, where issues like slavery, the price of rent, and the role of landlords in the national economy were treated at great length. I compiled case study after case study; glimpses of that work had appeared already in the methodological articles. I compiled the results into the manuscript for A Distant Reading of Property.

The political historian Luke Blaxill was able to generalize with accuracy and to refute many of the conclusions that other historians have offered about his period. But ‘distant reading’ alternated with ‘close reading,’ generalization with engagement with the historiography, through the entire process.

I was far from the only historian reasoning along these lines. The work of Ruth and Sebastian Ahnert, applying text mining to the Tudor State papers, demonstrated how computational tools could reveal hidden relationships in the archives.[7] Humanists like Richard Jean So, Andrew Piper, Ted Underwood, and Lauren Klein began writing history on the longue durée using digital tools.[8] Luke Blaxill performed an emblematic research process with secondary sources and digital tools for his War of Words, published as part of the Royal Historical Society’s Studies in History Series. Digital tools helped him to hold more facts in mind than traditional tools do. Blaxill was therefore able to generalize with accuracy and to refute many of the conclusions that other historians have offered about his period. But ‘distant reading’ alternated with ‘close reading,’ generalization with engagement with the historiography, through the entire process.[9]

Important theoretical engagements with historical time were offered by other historians. Notable here is the work of Dutch historian Pim Huijnen, who engaged theories of temporal change in relationship to the Dutch newspaper corpus; and that of political historians Joris Eijnatten, Pasi Ihalainen and Ruben Ros, who’ve applied text mining to problems of historical conceptualization—again working mainly with Dutch newspapers.[10] Working with the political scientist Gregory Wawro, Ira Katznelson urged fellow historians towards other machine learning methods that allow development of nuanced maps of temporal change.[11] Together, these publications are amounting to a robust methodology for digital history, distinct from the more general concerns of the digital humanities, which is capable of supporting sustained and careful research.

I was a hungry consumer of these conversations, and a delighted colleague and interlocutor of many of these individuals. With a ready-to-go parliamentary dataset in hand, I found that wonderful support was at my fingertips. The universities where I was working had few opportunities available for digitalization of documents, although they had other facilities for digital analysis. Local institutional investments meant that I had access to teams of data scientists and specialists in high-performance computation, the area of expertise necessary to organize data analytics at scale. For this, I had to thank colleagues in the information sciences, forward-looking provosts and Chief Technology Officers on two campuses—developments that History department colleagues were typically unaware of. Reaching out across campus gave me the networks I needed to ask questions.

With institutional supports such as these, I could explore the analysis of the parliamentary debates to my heart’s content, running analysis after analysis on the data. My job was to read enough in the methodological literature to recommend a new algorithm. Colleagues in data science and computing implemented the analysis, and my job (as I conceived it at this stage) was to read the results and interpret them.

“I began to understand that to satisfy the questions historians raised about method, I would need to pull apart the algorithms by myself. My process began to change once again.”

The data analysts were also incredibly useful for thinking about how to reconcile historical methods with digital ones. At my instruction, they produced indexes that made it possible to navigate between ‘distant reading’ of the corpus as a whole and a ‘close reading’ of particular passages of text, which could be treated like ordinary primary sources.

At the outset, I relied exclusively on my institutions’ teams of data scientists to clean data and run it through analysis. As time went on, however, I began to understand that to satisfy the questions historians raised about method, I would need to pull apart the algorithms by myself. My process began to change once again.

Intimate engagement with algorithms requires the ability to code. I hired a statistics graduate student to tutor over one summer, and began co-teaching a course in text mining with a coding component. I began to use the teams of data scientists as my guides and deep collaborators—working with them inside the code itself. I still leaned on the high-performance computing staff of the university mainframe for large-scale analyses, like word embeddings applied to an entire century of data. Increasingly, graduate students and undergraduates working with me developed the necessary expertise to oversee these processes on their own. My own capabilities expanded as well.

My approach to the work also began to develop. I theorized the task of applying critical thinking to the data, the algorithm, and the historical problem in turn in an article called ‘Critical Search.’[12] A second article, ‘The Algorithm’, applies the methods of Critical Search to a longue-durée analysis of British history in the nineteenth century.[13]

Rather than trusting technicians with the algorithms (even as my co-authors), I began to put my own energies into unpacking each algorithm, working with one process at a time. First, was an article on dynamic topic modeling; then one on tf-idf (a measure of the importance of a word in set of documents); then another on word embeddings (still in process). In between, I paused to write about questions of interpretation, drawn from working with students on the question of why text mining is not automatic, and how the historian chooses which words to interpret.[14]

I found myself asked to address larger issues of replicability. These included whether sharing the results of this research with colleagues also requires developing an interactive web browser where others can test their own version of a hypothesis (we think this is the ideal, but it is also obviously expensive).[15] I also developed a robust set of approaches for ‘validating’ historical knowledge by testing the results of algorithmic inquiry according to different means.[16] These meditations on each algorithm became the basis for my forthcoming book, The Dangerous Art of Text Mining.

Meanwhile the Archive

A savvy approach to the reality of delays when pursuing new research is to take multiple paths at once. My research project—to write a digital history of property over the longue durée—sprawled from five years to fifteen, and from one book into five, in part because of the richness of the archive and the complications of methodological fields that were maturing at their own rate.

Archival research by traditional means was, in many ways, the insurance policy behind my experimental endeavours. Even if digital techniques did not produce persuasive results, I would at least be able to write about the material I had gathered in archives. While I pursued various digital strategies, I could meanwhile point to historical publications along the way that marked my progress towards writing a long-term history of property.

“One of the morals of this story is that a scholar can maintain an interest in new methods while pursing traditional research in the archives … trips to the archive with photographic or scanning equipment can easily serve multiple aims.”

Meanwhile, the years in the archives, photographing materials from the territories across the British empire, the United Nations, and postcolonial movements around the world, had produced fruit. There was enough material in the notes I’d taken in archives to support a traditional work of history. The fact that I had digital facsimiles, easily readable by human techniques, meant that I had more than enough material to dive into the details of each case study.

One of the morals of this story is that a scholar can maintain an interest in new methods while pursing traditional research in the archives. So long as the goal is to acquire the right documents about the past, to read them, and to state the understanding found in those sources with accuracy, trips to the archive with photographic or scanning equipment can easily serve multiple aims.

The unexpected twists and turns in digital fields—and the abundance of materials in archives— forced me to ultimately conceive of not one project but five.

The Long Land War came out in 2022. Readers who knew my plans to use text mining to study property might have been surprised to see that this book had virtually no digitally-sourced data: a result of the fact that my digital practice was still evolving when my archival research was ready to go to press. A prequel to The Long Land War, narrating the nineteenth-century story, is currently in draft—its editing supported by a writing fellowship and the manuscript under contract with a trade press with the title How Not to Kill Your Landlord. Meanwhile, a methodological theory of text mining—The Dangerous Art of Text Mining—is forthcoming in 2023, and an accompanying textbook on coding for historical analysis is in preparation with the title Text Mining for Historical Analysis. A volume that marries the historical and archival research to the digital text mining is also in draft, under the title, A Distant Reading of Property. All the aspects of the project are lined up, eleven years after the global property project began. But they hardly played out as I might have expected. Instead of one book, there will be five. Instead of five years, it will have been at least fifteen.

The supervisors of doctoral theses rightly steer graduate students away from projects that are too vast, too long, or require methodological innovations that are impossible given the time ahead. My supervisors too had instilled in me a respect for the realistic limits of research.

But in 2008-09, I had the extraordinary opportunity to hazard something big but risky, and I took it. I had earned not one but two postdoctoral fellowships that made available five years for research. The fellowship at the Harvard Society of Fellows also came directly with a challenge, issued by the Society’s founder, James Russell Lowell, who feared that the modern research university was encouraging specialization in knowledge that might work against certain forms of discovery. The challenge was to use the time available to work between disciplines, to accomplish some insight that could not have been ascertained from within the constraints of college teaching and research within the disciplines.

I had not one but three questions, all direct outgrowths of publications already in press. First was the problem of the longue durée. What were students of modern History missing when they looked at ten or twenty years at a time rather than a century or two? I had started wondering about this in my first book—Roads to Power (2012)—which covered 1726 to 1848 (an unusual length for the time) but whose introduction and conclusion stretched from ancient Rome to the present day. Might not longer time scales bear fruit for other questions? Second was the question of common property rights over a two-hundred-year period, addressing the critical matter of to whom the land belonged, an issue of broad legal and political debate since the nineteenth century, tightly bound up with postcolonial questions and ideas of reparation in the twentieth century. The third problem was a hope that new methods that might be suited to longitudinal study. My early article using word counts in Google Books, then still a new corpus, had studied how strangers moved in the street in nineteenth-century London. Could not similar techniques be applied to the study of property?

The way I dreamed then of these conversations coming together was efficient, clear, and mechanical. I would hire data scientists from the burgeoning data sciences, who’d scan and analyze treatises on property law from places colonized by Britain over two hundred years. I would read widely from historical monographs and deeply in historical sources, churning my perspective into a single-volume study of property over the longue durée.

I looked at shelves full of authors who had attempted similar feats of long-term analysis with enormous aplomb and fascinating results. E.P. Thompson, Richard Tuck, and Stuart Banner had written long-term histories of property law relatively early in their careers. R.J.P. Kain had done so with maps, albeit after a lifetime of condensed study.

Dozens of books tracing various elements of my theme indicated that the framework of decades was too short—that the life of property was measured over centuries. Recent contributions to the history of property relating to the British empire, foregrounding land grabs from Australia and Africa to Canada, suggested that the issue was due for revisiting. A new century or two, a global literature, and a cutting-edge method that I had already engaged, applied to a literature on property that was already longue-durée in scope: it was a challenge, but an imminently feasible one.

“What I could not predict was the winding path that digital research would take—or how much more I would need to write and learn before the standards of digital research matched the standards of truth expected by the historical profession.”

As the foregoing sections of this article demonstrate, the challenge had twists and turns I could not have anticipated at the beginning. Many of these were wrapped up in the fact that both the digitalization of new archives, and text mining for historical analysis, needed time to mature as fields. What I could not predict was the winding path that digital research would take—or how much more I would need to write and learn before the standards of digital research matched the standards of truth expected by the historical profession.

Maintaining a focus on archival research—while in the midst of ongoing methodological work—meant it was ultimately possible to publish about both fields, even though these books came out separately and possess a distinctive disciplinary focus. Only after those original volumes were complete was it clear what I needed to do to write a digital history of property, in which methods and subject-matter could work together.

Conclusion

Today, the subfields of digital history are maturing. We have robust methods for text mining and for digitalization. The methods of text mining for historical analysis are ready to be deployed as chapters in dissertation research. They can be applied by anyone with basic coding skills or the assistance of trained coders. Digitalization now has a robust set of best practices, but it remains an expensive undertaking that requires the levels of investment typically housed at major universities.

Text mining is now reliable enough to be applied to clean historical data sets such as texts of debates of the British Houses of Parliament and US Congress. The tools I profile in The Dangerous Art of Text Mining—dynamic topic models, temporal distinctiveness, word embeddings, named entity recognition, and parts-of-speech analysis—have been used by historians to produce new knowledge. As I describe in ‘Critical Search’, these approaches can provoke new insights from the researcher. They can be iteratively engaged to open up subaltern perspectives and questions about the long- and short-term dynamics of temporal change.

“What digital tools offer is support for thinking over long-term horizons, gaining confidence earlier in the process of reading, and finding surprising patterns in the archive.”

My decision to explore digital history techniques, at a time when the methods of digitalization and text mining were underdeveloped, has given me amazing opportunities to engage with new developments and contribute to research techniques I hope will be of use to future scholars. But it’s also a path full of unexpected twists. Had I been a graduate student with limited time to complete my project, the route I took might have been disastrous. Similarly, if I had not had tenure for my first book, the unexpected delays that were part of research in a new field might have caused real trouble.

One final, important lesson of this story is that there’s nothing automatic about digital history. None of the algorithms I’ve tested offers a magical tool for extracting information about the past without some other form of engagement with history, via primary and secondary sources.

What digital tools offer is support for thinking over long-term horizons, gaining confidence earlier in the process of reading, and finding surprising patterns in the archive. But in the art of text mining, as I conceive it, the researcher is always a historian who’s simultaneously reading the work of colleagues and engaging it with digital tools.

About the author

Jo Guldi is Professor of History at Southern Methodist University, Dallas, TX, and a leading practitioner and critic in the use of digital resources and methods for historical research.

Her work on the history of property includes the monographs, Roads to Power. Britain Invents the Infrastructure State (Harvard UP, 2012) and The Long Land War. The Global Struggle for Occupancy Rights (Yale UP, 2022).

With David Armitage, Jo has published The History Manifesto (Cambridge UP, 2014), a study of the value of longue durée-thinking for the historical interventions in contemporary society. Details of her extensive work on digital historical practice and capability are listed in this article and also via her website.

Jo’s next book, The Dangerous Art of Text Mining. A Methodology for Digital History, is published by Cambridge University Press in summer 2023.

References and Notes

[1] Ian Milligan, The Transformation of Historical Research in the Digital Age (Cambridge: Cambridge University Press, 2022). https://www.cambridge.org/core/elements/transformation-of-historical-research-in-the-digital-age/30DFBEAA3B753370946B7A98045CFEF4

[2] I use the term ‘digitalized’ in distinction to the oft-used ‘digitized’ because the technical work of the digital historian begins in most instances with the steps of scanning and optical character recognition discussed below, which turn text into digital transcripts or digital text, where words can be counted by computer. Those transcripts are then further analyzed as digits by binary code. But the intermediary step is digital text, which is what my code works on, not digits in binary. Hence I will describe our work as ‘digitalization’ — making digital objects that can be analyzed by digital processes – as opposed to ‘digitization’, or turning into digits.

[3] Joris van Eijnatten, Toine Pieters, and Jaap Verheul, ‘Big Data for Global History: The Transformative Promise of Digital Humanities,’ BMGN – Low Countries Historical Review 128, no. 4 (December 16, 2013): 55–77, https://doi.org/10.18352/bmgn-lchr.9350.

[4] Ruth Ahnert, Emma Griffin, Mia Ridge and Giorgia Tolfo, Collaborative Historical Research in the Age of Big Data (Cambridge: Cambridge University Press, 2023).

[5] Ryan Cordell, ‘” Q i-Jtb the Raven’: Taking Dirty OCR Seriously,’ Book History 20, no. 1 (2017): 188–225.

[6] Adam Crymble, Technology and the Historian: Transformations in the Digital Age (University of Illinois Press, 2021).

[7] The three journals are Current Research in Digital History, The Journal of History, Culture and Modernity (which doesn’t have digital in the name but has had a focus), and The Journal of Digital History.

[8] Luke Blaxill, The War of Words: The Language of British Elections, 1880-1914, Royal Historical Society, Studies in History. New Series 103 (Woodbridge, Suffolk [England]; Rochester, NY: The Boydell Press, 2020), https://doi.org/10.2307/j.ctvnwc01p

[9] Ruth Ahnert and Sebastian E Ahnert, ‘Metadata, Surveillance and the Tudor State,’ History Workshop Journal 87 (April 1, 2019): 27–51, https://doi.org/10.1093/hwj/dby033; Ruth Ahnert and Sebastian E. Ahnert, ‘Protestant Letter Networks in the Reign of Mary I: A Quantitative Approach,’ ELH 82, no. 1 (March 10, 2015): 1–33, https://doi.org/10.1353/elh.2015.0000.

[10] Andrew Piper, Enumerations: Data and Literary Study (Chicago: University of Chicago Press, 2019), http://chicago.universitypressscholarship.com/view/10.7208/chicago/9780226568898.001.0001/upso-9780226568614, Richard Jean So, Redlining Culture: A Data History of Racial Inequality and Postwar Fiction (Columbia University Press, 2020). Ted Underwood, Distant Horizons: Digital Evidence and Literary Change (Chicago: The University of Chicago Press, 2019). Sandeep Soni, Lauren F. Klein, and Jacob Eisenstein, ‘Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers,’ Journal of Cultural Analytics 1, no. 2 (January 18, 2021): 18841, https://doi.org/10.22148/001c.18841; Lauren F. Klein, ‘W.E.B. Du Bois’s Data Portraits: Visualizing Black America Ed. by Whitney Battle-Baptiste and Britt Rusert (Review),’ African American Review 53, no. 2 (June 16, 2020): 152–54, https://doi.org/10.1353/afa.2020.0024.

[11] Blaxill, The War of Words.

[12] Pim Huijnen, ‘Digital History and the Study of Modernity,’ International Journal for History, Culture and Modernity 7, no. 0 (October 31, 2019), https://doi.org/10.18352/hcm.591. Pasi Ihalainen and Joris van Eijnatten, ‘Ecumene Defined,’ in Nationalism and Internationalism Intertwined A European History of Concepts Beyond the Nation State, ed. Pasi Ihalainen and Antero Holmila (New York: Berghan, 2022). Joris van Eijnatten and Ruben Ros, ‘The Eurocentric Fallacy. A Digital-Historical Approach to the Concepts of “Modernity”, “Civilization” and “Europe” (1840–1990),’ International Journal for History, Culture and Modernity 7, no. 1 (November 2, 2019): 686–736, https://doi.org/10.18352/hcm.580. Ruben Ros, ‘Conceptualizing an Outside World: The Case of “Foreign” in Dutch Newspapers 1815–1914,’ Contributions to the History of Concepts 16, no. 2 (December 1, 2021): 27–51, https://doi.org/10.3167/choc.2021.160203.

[13] Gregory Wawro and Ira Katznelson, Time Counts: Quantitative Analysis for Historical Social Science (Princeton, N.J.: Princeton University Press, 2022), https://press.princeton.edu/books/hardcover/9780691155043/time-counts.

[14] Jo Guldi, ‘Critical Search: A Procedure for Guided Reading in Large-Scale Textual Corpora,’ Journal of Cultural Analytics, 2018, 1–35, https://doi.org/10.22148/16.030.

[15] Jo Guldi, ‘The Algorithm: Mapping Long-Term Trends and Short-Term Change at Multiple Scales of Time,’ American Historical Review 127, no. 2 (June 1, 2022): 895–911, https://doi.org/10.1093/ahr/rhac160.

[16] Ashley S. Lee et al., ‘The Role of Critical Thinking in Humanities Infrastructure: The Pipeline Concept with a Study of HaToRI (Hansard Topic Relevance Identifier),’ Digital Humanities Quarterly 14, no. 3 (25 September 2020); Jo Guldi, ‘Can Algorithms Replace Historians?,’ in Historical Understanding: Past, Present, and Future, ed. Zoltán Boldizsár Simon and Lars Deile (London: Bloomsbury Publishing, 2022).

[17] Jo Guldi, ‘From Critique to Audit: A Pragmatic Response to the Climate Emergency from the Humanities and Social Sciences, and a Call to Action,’ KNOW: A Journal on the Formation of Knowledge 5, no. 2 (September 1, 2021): 169–96, https://doi.org/10.1086/716854; Jo Guldi, ‘Scholarly Infrastructure as Critical Argument: Nine Principles in a Preliminary Survey of the Bibliographic and Critical Values Expressed by Scholarly Web-Portals for Visualizing Data,’ Digital Humanities Quarterly 14, no. 3 (1 September 2020).

[18] Jo Guldi and Benjamin Williams, ‘Synthesis and Large-Scale Textual Corpora: A Nested Topic Model of Britain’s Debates over Landed Property in the Nineteenth Century,’ Current Research in Digital History 1 (2018), https://doi.org/10.31835/crdh.2018.01. Jo Guldi, ‘Critical Search: A Procedure for Guided Reading in Large-Scale Textual Corpora,’ Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.030.

ALSO AVAILABLE IN THIS SERIES

Part One: ‘We are all Digital Now: and what this means for historical research’, by Ian Milligan

Part Two: ‘Tools for the Trade: and how historians can make best use of them’, by William J. Turkel

Part Three: ‘Why Archivist Digitise: and why it matters’, by Anna Mcnally

Part Four: ‘Researching with Big Data; and how historians can work collaboratively’, by Ruth Ahnert

Part Five: ‘Digitising History from a Global Context; and what this tells us about access and inequality’, by Gerben Zaagsma

Part Six: ‘The Trouble with Text Mining: and why some projects take a long time, and future projects might take less time’, by Jo Guldi

FURTHER RHS BLOG SERIES

The Society’s blog, Historical Transactions, offers regular think pieces on historical research projects and approaches to the past. These include several previous series, addressing wide-ranging questions concerning historical methods and the value of historical thinking.

Recent contributions to series include ‘Writing Race’ and ‘What is History For?’ We welcome proposals for other short series of posts, bringing historians together to discuss topics, practices and values. If you’d like to suggest a RHS blog series, please email: philip.carter@royalhistsoc.org.

Follow This Blog

Follow us on BlueSky

@royalhistsoc.org

Historical Transactions

ROYAL HISTORICAL SOCIETY

BLOG AND ONLINE RESOURCES

The Trouble with Text Mining: And why some projects take a long time, and future projects might take less time

Introduction and Context

D681’ lIr. Schickele,

In ‘IIrf hute to n-i:be TR on J1I118 20 SCI that. the letwr would bs travelling over the week~’.. I n”,lected. to s<Q’ that Orade F-4 begins at $7)00 with rmnual 1ncrementi 0:£ $225 to #2,0 \\1P to ‘9500.

How Historical Practices Mature

Text Mining for Historical Research

The Field of Text Mining and How it Grew

Meanwhile the Archive

Conclusion

About the author

References and Notes

ALSO AVAILABLE IN THIS SERIES

FURTHER RHS BLOG SERIES

Follow This Blog

Categories

Follow us on BlueSky

The Trouble with Text Mining: And why some projects take a long time, and future projects might take less time

Introduction and Context

D681’ lIr. Schickele,

In ‘IIrf hute to n-i:be TR on J1I118 20 SCI that. the letwr would bs travelling over the week~’.. I n”,lected. to s<Q’ that Orade F-4 begins at $7)00 with rmnual 1ncrementi 0:£ $225 to #2,0 \\1P to ‘9500.

How Historical Practices Mature

Text Mining for Historical Research

The Field of Text Mining and How it Grew

Meanwhile the Archive

Conclusion

About the author

References and Notes

ALSO AVAILABLE IN THIS SERIES

FURTHER RHS BLOG SERIES

Follow This Blog

Subscribe

Categories

Follow us on BlueSky

In ‘IIrf hute to n-i:be TR on J1I118 20 SCI that. the letwr would bs travelling over the week~’.. I n”,lected. to s<Q’ that Orade F-4 begins at $7)00 with rmnual 1ncrementi 0:£ $225 to #2,0 \\1P to ‘9500.