I think The Programming Historian sums up my wariness of data mining best:
Sure, these are powerful tools if one takes the time to get to know them and can do so without permanently damaging the data itself. There is the beginning of a solution in making multiple backup copies of data, but I know from working with different versions of CSS and HTML for my own webpage that even when files are properly named and organized, it is very easy to lose track of which version is the most recent, leading to duplicated work and inconvenience.
It took me several tries to get my own data loaded into Voyant. I began by trying something from Google Books, Willard Glazier’s The Capture, Prison Pen, and Escape. PDF would only read the first page of the file, which is always the Google Books information sheet. EPUB did slightly better, but I could not load individual pages and all of the “most common words” were shown as individual letters, often with accent marks and other abnormalities. Because I could not see the individual pages, it is hard to say for sure that Voyant had translated it into gibberish, but I suspect that was the case.
Finally, I went to good old archive.org, because I know that they provide books in plain text. I found Alonzo Cooper’s In and Out of Rebel Prisons and got Voyant to at least show me the word cloud and list of most frequently used words–in English, no less! The internet both in my office on campus and at my apartment is apparently too slow to load some pages, although a few have loaded in the time it has taken me to write this post.
Clearly, the text needs a great deal of help. I’m haunted by a line from an article by Ted Underwood, in which he states that he had to write 50,000 individual rules to correct OCR from 18th and 19th century manuscripts. While his material comes from a slightly earlier period and my Civil War documents don’t have the f/s problem, correcting this OCR would still require thousands of lines.
I’m still having trouble conceptualizing how text mining could help me in my own research. The applications I’ve thought of have been things that I don’t necessarily need text mining to do–for example, if I want to find “Unionist” in close proximity to “Transylvania County,” then there are tools already built into Google that can do that.
After a few more minutes of playing around and mapping frequencies, I did come up with one very basic thing that Voyant might be able to do. This is a screenshot showing the frequency of the word “prison” in the manuscript. While rudimentary, it does show that there is a significant drop in the use of the word starting around a third of the way through and continuing past the halfway point of the text. By clicking on the lowest point and skimming the text, I could see that this section of the text described Cooper’s escape from a Confederate prison in Columbia, SC. I knew that–that’s why I’m interested in the book–but if I were working with an unfamiliar text and wanted to know if it could be useful in my research, then a similar technique might allow me to rule out books that have little or no relevance.