| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Our Topic Models

Page history last edited by Alan Liu 8 years, 10 months ago

 

Overview

Topic Models for 1880s Children's Literature Corpus

Topic Models for Our 1880s Adult Fiction Corpus

Our Process and Results

Suggestions for Future Topic Modelers


Overview

We created the following topic models using the Mallet topic modeling tool (including a parameter of "--optimize-interval 20" in the command line for training models). After experimenting with different numbers of topics in our topic models (starting with the 500 topics that Matthew Jockers found to be optimal for his corpus of 3,346 novels), we settled on 300-topic models for our smaller corpora, which ranged between 134 and 451 works.1  For the purposes of topic modeling, we scrubbed the texts using the online Lexos tools (loading Jockers's stoplist for "scrubbing" and preserving internal apostrophes).  We also used Lexos to chunk the novels into 1,000-word segments.2
             After generating each topic model, we sorted the "keys.txt" file to rank the topics by weight.  Then in a first interpretive stage, our topic-modeling team inspected, labeled, classified, and compared what we felt were coherent, interpretable topics among the top-weighted ones, comparing topics in whole corpora against those in female-authored, male-authored, and other subcorpora.


Topic Models for 1880s Children's Literature Corpus

(Note: "chunked" below means that we generated our topic models from cleaned plain-text versions of the works that we segmented into 1,000-word chunks.  We did this to optimize the "interpretability" of the topics. We understand the theory to be, roughly: what is the length of a unit of text in the genre being studied that corresponds roughly to the "attention-span" of a "topic"?). 

Topic Models for Our 1880s Adult Fiction Corpus

 

 



Our Process and Results--Alec Killoran
As a group, we looked at the top fifty topics from our corpus (and also in branches of our corpus). We collectively determined which topics were obvious and cohesive enough for us to assign them a label.  After doing so, we identified seven more overarching themes that appeared across multiple branches of our corpus.  We then created tables for each of these seven themes as seen below, and we populated the tables with the most heavily weighted topics within themes.  Finally, we drew a few basic conclusions from our tables.  As a result of this methodology, we threw out a number of heavily weighted topics because they did not fit within a particular theme. That allowed us to focus on what we felt were interpretable topics.  The result is the set of selected, focal topics represented in the document "Categorized Topics" on which our topic-modeling group collaborated.

Topic Selection: Lindsay Blackie, Maithy Do, Ashley Jeun, Alec Killoran, Aaron Woldhagen Table Creation and Formatting: Aaron Woldhagen Conclusions: Tables 1-3: Alec Killoran Conclusions: Tables 4-5: Maithy Do Conclusions: Tables 6-7: Ashley Jeun  

Topics regarding death were weighted heavily in the adult British fiction corpora, and only one significantly weighted topic from the children’s literature corpus made an appearance in this area.  Additionally, the table shows that novels by female authors produce a few variable topics relating to death, while male authors produced a single, extremely heavily weighted topic.  Some assumptions are confirmed in this set.  Female authors produced a topic regarding the death of children (#5 on the table), while the children’s literature authors produced a topic with concrete descriptions of death.
Courtship and chivalry topics were exclusively found in the adult fiction corpora.  As a topic, it figured heavily in both female and male authored works.  The second topic in the table deals with betrothal, and the transactional words “offer,” “give”, and “consent” betray its male-centric underpinnings.  Conversely, the topics produced by female authors largely elicit images of courtship beyond the scope of simply arranged marriages.  The concept of mutual love is present in these topics.  The third topic in the table bridges the gap between the misogyny of the second topic and the social optimism of the last two topics.  The topic paints women in a fairer light, but simultaneously objectifies them.  Very little is offered in any of these topics about the actual relationships between suitors.
Children’s literature produced a number of topics relating to war and violence.  It is safe to say that the war and adventure genre saw significant play in children’s literature of the 19th century.  Of particular note in these topics is the absence of the macabre.  Though the topics deal with armies, war, and danger, there were no topics dealing with the consequences therein.  It is an important discrepancy, since depictions of death in other corpora do not shy away from the tragic aspect of it.  The obvious conclusion here is that children’s literature authors did not include such macabre descriptions of war in their novels.  The human cost of war was not at issue, but rather the larger, perhaps more glorious, and more sweeping elements of armed conflict.
Families usually involved a mother, a father, a daughter, and a son at war. It appears that there were many topics of sadness and dread, perhaps in relation to the son going off to war, as well as happiness when the son returned home. Daughters appear more light-hearted, filled with laughter and affection.  
There are several negative words associated with work and the everyday, such as “didn’t,” “wouldn’t,” “couldn’t.” Time and money are key issues, but it appears that work is only done during the day. At night, there is laughter, dinner parties, and social circles. 
The top finding for the topic model "Happiness and Beauty" was found in the Children's Literature British Female, the highest occurring category weighing at 0.9234. This category represented items including terms such as “life, day, time, people, family, days, age,” characterizing the topic in relation to day in the life and caring about personal aspects, i.e. family and people. The second highest was also in the same category of Children’s Literature British Female, weighing at 0.554. The terms included in this category were terms such as “friend, business, times, work, pleasure, worst, afraid, silence, courage, companion,” indicative of business transactions and the emotions involved in them, as well as pleasure or disapproval in work. The third highest occurring category was in the 1880s British Female category, weighing at 0.6205. The terms included were “mind, felt, heart, sense, kind, laugh, question, smile, eyes”—these terms are also in relation to physical features and emotional experience. Overall, the topic model “Happiness and Beauty” most often occurred in the female gender than the male or combined categories, suggesting that women authors in Children’s British Literature as well as in 1880s British texts mentioned such topics more than male authors.
For the topic model “Conversations and Speech,” the highest occurring topic was in the category of 1880s British Female weighing at 0.6359. The terms included were “people, understand, talk, suppose, knew, speak, word, idea, find, heart, talking.” These words correlate to conversational and dictated speech between two or more individuals. The second highest occurring category was in the 1880s British Male group, weighing at 0.6044 and including terms such as “people, talk, deal, call, suppose, men, pleasant, conversation, nice, talking, fact, laughing.” The third highest was also in the 1880s British Male category, weighing at 0.5458. Some words in this category were “question, matter, present, people, opinion, subject, facts, interest, view, difficulty, idea, order, decided.” Looking at both mentioned male categories, it appears that among male authors the largest occurring topics are topics dealing with opinionated decisions, personal views, and social calls. Authors of both genders, male and female, as well as both 1880s and Children’s Literature, are actively involved in the topic of “Conversations and Speech.”  


Suggestions for Future Topic Modelers--Alec Killoran

With additional time, narrower themes could be identified, or individual topics could be analyzed.  Due to the nature of topic modeling though, stronger conclusions arise from instances of thematically grouped topics.  As a group, we would suggest that any team doing topic modeling research at least attempt to develop larger themes with which to group topic models, especially across different corpora.  Additionally, though our limited time did not allow for it, our class initially wanted to compare corpora across different decades.  Thematic grouping within a particular genre might be an ideal way to examine the changes in that particular genre over time.  There is either very little or no precedent at all for our methodology of thematic grouping, and our group strongly recommends it to topic modelers as both a base from which to begin analyses, and as an enhancement to already established analyses.

Notes


1. For Matthew L. Jockers's discussion of issues involved in setting the number of topics for a topic model, see his Macroanalysis: Digital Methods and Literary History (Champaign, IL: U. Illinois Press: 2013): 128.

2. For Jockers's discussion of his conclusion that 500-1,000 word chunks is optimal, see his blog post "'Secret' Recipe for Topic Modeling Themes" (Matthew L. Jockers, 12 April 2013)

Topic modeling team: Lindsay Blackie, Maithy Do, Ashley Jeun, Alec Killoran, and Aaron Woldhagen.

 

 

 

Comments (0)

You don't have permission to comment on this page.