nyrbflat

You can click on the above image to zoom in!

Background

Around the time I put together the New York Review of Books project, I used the Wayback Machine to collect approximately twenty years of personals ads from the NYRB.1

I initially used the files to learn how to train GPT-2 to replicate the style of model text. However, I found the end result somewhat unsatisfying in the sense that a lot of NYRB personals already might have been generated by a computer. The slightly sloppier computer-generated version wasn’t all that interesting.

So what I recently did instead was use Andreas Mueller’s WordCloud module to generate, well, wordclouds or tag clouds from the personals listings (after scrubbing emails and addresses). It is an imperfect representation of a text, but I think that it does a good job of highlighting the frequency of certain terms and phrases.

What’s neat about the wordcloud module itself is that it gives you a fair amount of control of the end product’s substance without processing the words yourself first. You can tune the extent to which it highlights bigrams (two-word pairings) as part of its frequency analysis, set stop-words, determine whether to include numbers, and implement word length thresholds.

The wordcloud module also gives you a fair amount of control about how your cloud looks (fonts, colors, verticality of text). What I think is especially neat is that you can use a picture to apply a mask to which the wordcloud conforms.

What I wish were possible within the module, and would be fairly straightforward to do using something like NLTK to process the text first, would be to control which bi-grams and parts of speech are charted.

Clouds

I wanted to briefly highlight what the collocation parameter does, since I think that the module’s documentation doesn’t do the best job at it. All of the images are vector files, so you should be able to zoom in to see even the smallest individual words. (The image at the top of this post shows the masking feature, which is neat but mostly cuts down on how much usable space you have)

Collocation Threshold

This parameter uses something called the Dunning likelihood collocation score to determine whether bigrams should be charted. More or less, the lower the number is the more bigrams you see. What is most striking is that the term “seeks” becomes subsumed in two-word phases that are less common than the word itself.

Here’s 50:

/assets/nyrbpersonals50bigram.svg

Here’s 12:

/assets/nyrbpersonals12bigram.svg

Here’s 6:

/assets/nyrbpersonals6bigram.svg

  1. Using Waybackpack to save files locally and then BeautifulSoup to extract the listings.