Data Sources

Textual data tips

Is it machine-readable?

In order to do digital analysis on a text or group of texts, you'll need digitized, machine-readable versions of the texts (if you open a text file and can select - highlight - and copy the text, it's machine-readable). Some file formats that are machine-readable by default: HTML, .txt, XML, docx. PDF files may or may not be machine-readable. Again, to determine if a .PDF file is machine-readable, try the highlight and copy text test. If you can do this, optical character recognition (OCR) software has been run on the PDF file. If you can't, you can use Adobe Acrobat Pro DC, which is installed on all campus computers, to run OCR on the file. Older versions of Adobe Acrobat Pro also have this functionality.

Will it work with my analysis software?

Check the software's help documentation to find out what file types it's capable of ingesting.

Textual data sources - MHC digital collections

Early English Books Online (EEBO)
Searchable collection of over 130,000 titles (books, pamphlets, periodicals, tracts, etc.) on all topics published in the British Isles and dependencies between 1473-1700. All of these are available as page images (not OCRed), but 25,277 are also available as keyed full text. As Text Creation Partnership members, Mount Holyoke users can also access the TCP collection via the University of Michigan Interface. EBBO includes works in over thirty languages (not just English).

HathiTrust Digital Library
HathiTrust is free to the public, but if you log in through Mount Holyoke you will have greater access to books and tools. Books in the public domain are downloadable in PDF format (usually OCRed).

ProQuest News
Another news source. Most newer articles are machine readable and downloadable in a variety of formats (HTML, RTF, PDF, Text only). Multiple articles can be downloaded simultaneously in one, combined file. Older news stories may only be available in un-OCRed PDF format, and therefore are not machine readable.

JSTOR
Database of scholarly journals in all subject areas. Articles for which full text is available (usually not the most recent few years of any given journal title) may be downloaded individually in OCRed PDF format. JSTOR's new Data for Research (DFR) service also offers a way to request large quantities of documents, among other data. See the "How to get larger data sets" section of their About Data for Research page.

Project MUSE
Database of scholarly journals in the humanities and social sciences. Articles may be downloaded individually in HTML or OCRed PDF format (or both).

Women Writers Online
Database of early women’s writing in English, published by the Women Writers Project at Northeastern University. It includes full transcriptions of texts published between 1526 and 1850.

Textual data sources - free

Directory of Open Access Books
Searchable collection of Open Access peer-reviewed books from academic publishers. Most books are available for download in PDF format (OCRed).

The Internet Archive Texts
Online collection of over 6,000,000 fully accessible public domain eBooks. PDF and Full Text (.txt) versions available, but quality of OCR (optical character recognition) varies as these have not been proofread. Further proofreading/processing prior to use in digital text analysis may be required.

Oxford Text Archive
A repository of digital literary and linguistic texts and corpora for research and teaching in XML, HTML, ePub, mobi (Kindle), and plain text formats. Check the Availabilty information for each text or corpus to determine if it's freely available to use or if there are any use restrictions. Some items may require you to ask permission first.

Project Gutenberg
A collection of over 46,000 ebooks in the public domain. Plain Text UTF8 (.txt) and HTML versions available. All books have been proofread with the help of volunteers.