Our Corpora

Page history last edited by Alan Liu 4 years, 10 months ago


English 197 Corpora Folder on Google Drive


Works in our children's literature corpus came from Project Gutenberg's "Children's Bookshelf" category.  We drew works of fiction (mostly novels, but including some short fiction) from all the subcategories on that bookshelf, constraining our selection to works published in the 1880s.  Works in our adult fiction corpus came from the corpus of 2,731 nineteenth-century British novels given to us by the Stanford Literary Lab (originally gathered by the Lab from the Internet Archive and Project Gutenberg). (Thanks to Ryan Heuser of the Stanford Literary Lab.) We constrained our selection to male and female authored novels of the 1880s.


Below are links to zip files on our course Google Drive that contain our corpora and sub-corpora.  These include the plain-text files for full works.  Other zip files in our Google Drive folder contain works that have been "cleaned" (we used the Lexos "scrubber" and Matthew Jockers's stoplist) and also cleaned-and-"chunked" (we used the Lexos chunker to break files for topic modeling into 1,000-word segments).


Since all the works in our 1880s corpora are in the public domain, we have made them available in their various original plain-text, cleaned, and cleaned-and-chunked versions here, together with spreadsheets of metadata about the works.  The exception is that we have not made publicly accessible on our Google Drive the full nineteenth-century British fiction corpus (and metadata spreadsheet) originally given to us through the generosity of the Stanford Literary Lab.  We did not feel that the intellectual and manual labor that went into that was ours to pass along on our own initiative.


Adult British Fiction - 1880s (451 works) (metadata spreadsheet)

Children's Fiction - 1880s (135 works) (metadata spreadsheet)



Special thanks to class members Lindsay Blackie, Alec Killoran, and Aaron Woldhagen for assisting with the assembly of the corpora.  Thanks to the Stanford Literary Lab for sharing the larger corpus of British nineteenth-century fiction from which the course drew its subset of 1880s "Adult British Fiction" works.


