How to Perform Basic Text Analysis

My previous post “Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics” discusses a simple exploration I did into algorithmically rating SXSW titles, most of which I did while on a plane trip last week. What I did was pretty basic, and to demonstrate that, I’m following up that post with one that explicitly shows you how you can do it yourself, provided you have access to a Mac or Unix machine.

There are three main components to doing what I did for the blog post:

Topic modeling code: the Mallet toolkit’s implementation of Latent Dirichlet Allocation
Language modeling code: the BerkeleyLM Java package for training and using n-gram language models
Unix command line tools for processing raw text files with standard tools and the topic modeling and language modeling code

I’ll assume you can use the Unix command line at at least a basic level, and I’ve packaged up the topic modeling and language modeling code in the Github repository maul to make it easy to try them out. To keep it really simple: you can download the Maul code and then follow the instructions in the Maul README. (By the way, by giving it the name “maul” I don’t want to convey that it is important or anything–it is just a name I gave the repository, which is just a wrapper around other people’s code.)

After you do that, you’ll have a directory “maul” where you unpacked the file. Go to ‘maul/data/sxsw’ and list the directory contents.

$ cd ~/Desktop/maul/data/sxsw
$ ls
example1k_statuses.txt sxsw2015_proposal_titles.csv

We are going to use these files to generate many other files to support the exploration of the titles.

To begin, we need to get the raw data into shape. You’ll find it in ‘maul/data/sxsw’. Let’s look at the format of the proposal title file first.

$ head -5 sxsw2015_proposal_titles.csv
Book Reading,InteractiveIdea,All Edge: Inside the New Workplace Networks
Book Reading,InteractiveIdea,Brands Win National Championships, Not Defenses
Book Reading,InteractiveIdea,Engaged Journalism: Connecting With News Audiences
Book Reading,InteractiveIdea,Get Big Things Done with Connectional Intelligence
Book Reading,InteractiveIdea,Moneyball for Marketing - Using Big Data to Win

The contents are given in three columns, the first for what kind of presentation it is, the second for what part of SXSW it is for (interactive, edu, music, or film), and the last is the title of the proposal. To see how many there are of each is easy using Unix tools (for some more details about Unix pipelines, see my blog post “Unix pipelines for basic spelling error detection” and my Unix pipeline slides).

$ cut -f2 -d, sxsw2015_proposal_titles.csv | sort | uniq -c | sort -nr

1892 InteractiveIdea
678 EduIdea
170 MusicIdea
125 FilmIdea

Let’s pull out just the titles so we can mess around with them easily. We can’t use standard CSV processing because there are commas in the titles and the titles aren’t protected. This could be fixed, but hey, that’s often how the data shows up. Fortunately, the first two columns are guaranteed not to have commas, so we can just provide an open-ended range to ‘cut’. Also, because some titles were submitted to multiple areas, we need to sort and uniq the result of this to get the unique set of 2847 titles.

$ cut -f3- -d, sxsw2015_proposal_titles.csv | sort | uniq > titles.txt

Here are the first five titles.

$ head -5 titles.txt
"Ama-Zonas": music as an agent of social change
"At-Risk to Succeed" - Student Panel on SEL Course
"BYOC" With A Pocket Full Of WebSockets
"CAT"astrophe: Good, Bad and Ugly of Internet Cats
"Dumb" App Design: Yo And The Magic Of Minimalism

We don’t have a lot of data, and we are willing to be pretty crude so that we see variations like CAT, Cat, and cat as all the same word. For this, we lower-case all characters in the titles, and change to a space everything that isn’t a lowercase letter or a space (but keep newlines to separate titles).

$ tr 'A-Z' 'a-z' simplified_titles.txt

Here’s what the titles now look like.

$ head -5 simplified_titles.txt
ama zonas music as an agent of social change
at risk to succeed student panel on sel course
byoc with a pocket full of websockets
cat astrophe good bad and ugly of internet cats
dumb app design yo and the magic of minimalism

I did say I was going to be crude here… it could be a good exercise to do everything I’m doing here, but using Twokenize (links to Java and Python versions on the TweetNLP site) to tokenize and simplify the titles instead.

Computing a topic model is straightforward.

$ ../../maul.sh mallet-lda --num-iterations 1000 -n 10 simplified_titles.txt

If you want to see what happens when you ask for fewer or more topics, change the -n option, e.g. -n 5 for five topics. (With such a small dataset, you probably won’t get much useful beyond twenty topics.)

To get the unigram counts, we simply get a word on each line, then sort, uniq and count them.

$ tr -cs 'a-z' 'n' < simplified_titles.txt | sort | uniq -c | sort -nr | head -30

For counting bigrams, we need to make sure to distinguish one title from another so that we don’t create a bigram from the last word of one title and the first of the next. This is easy to do — we just change all new lines into the string EOS (for End Of Sentence). We also strip out a straggling first empty line using “sed 1d”.

$ perl -pe "s/n/ EOS /g" title_words1.txt

The file title_words1.txt has one word per line, so to get bigrams, we can create a new file that doesn’t have the first word but contains all the rest, and then paste the two together to get the bigrams.

$ sed 1d title_words1.txt > title_words2.txt
$ paste -d " " title_words1.txt title_words2.txt | sort | uniq -c | sort -nr | grep -v EOS | head -30

To get the trigrams, we just repeat the above recipe, this time stripping off the first word from title_words2.txt and pasting all three files together.

$ sed 1d title_words2.txt > title_words3.txt
$ paste -d " " title_words1.txt title_words2.txt title_words3.txt | sort | uniq -c | sort -nr | grep -v EOS | head -30

We can add a grep and awk command to the above pipeline to get only the trigrams that contain “world” and have a count above one.

$ paste -d " " title_words1.txt title_words2.txt title_words3.txt | sort | uniq -c | sort -nr | grep -v EOS | grep world | awk 'int($1)>1'

Finally, we can do it one more time to get the quadrigrams with count greater than two.

$ sed 1d title_words3.txt > title_words4.txt
$ paste -d " " title_words1.txt title_words2.txt title_words3.txt title_words4.txt | sort | uniq -c | sort -nr | grep -v EOS | awk 'int($1)>2'

Some of you may be asking at this point: why not just write a program to do all this? That’s a fine option, but often a bit of command-line hacking goes a long way and allows you to quickly test out some ideas and play with data much more rapidly than if you write and run a program. It also allows you to write simpler programs that you use as part of a Unix pipeline. If you do it enough, you might find yourself looking at it as Unix poetry. It also allows me to put those commands above without having to point you to an introduction to Python, Scala, or another programming language.

Next up: building and using an n-gram language model. I provide a small example set of 1000 tweets to work with to test out the commands, but to get anything interesting, you’ll need to get a larger set of your own by pulling from the spritzer and extracting the tweet text. You can check out how to do this by looking at my previous tutorials on getting data from Twitter using the streaming API or using Scala (and there are plenty of others on the internet for other languages).

Here are the first five tweets in the data (mildly anonymized so that the mentioned users aren’t explicitly named in this post).

$ head -5 example1k_statuses.txt
I saw a $20,000.00 bottle of wine tonight. The end.
RT @USERNAME: Harry Potter and the Chamber of I Feel Like I Can Tell You Anything #RomanticActionMovies @midnight
Read and take note.... http://foo.bar
People who don't have a good taste in movies irritate me
@USERNAME haha no worries Luke! It was great having you guys there. P.S I missed my bus and ended up partying at Goodgod til sunrise

We first need to prepare the raw tweets so that they are good as training material for analyzing the titles. We both simplify and de-twitterize it with the following ugly, monstrous command, which lowercases all words, removes @-mentions, removes links, the string “rt”, and then replaces everything that is not a lowercase letter or a newline with a space (and finally does a bit of cleanup of white space from the start and end of every line.

$ cat example1k_statuses.txt | tr 'A-Z' 'a-z' | perl -pe "s/@w+//g" | perl -pe "s/http[^s]+//g"| perl -pe "s/rt//g" | tr -cs 'a-zn' ' ' | sed 's/^[ t]*//;s/[ t]*$//' | sed '/^$/d' > simplified_statuses.txt

It isn’t pretty, but I’m including it to show how one can just bash away at the problem using various tools. I basically did it by adding one thing at a time until the output looked like what I wanted, which again speaks to a strength of the Unix command line: it allows you to see the output as you build up your pipeline and there are a huge number of tools at your disposal for getting what you want. Here are the first five tweets now.

$ head -5 simplified_statuses.txt
i saw a bottle of wine tonight the end
harry potter and the chamber of i feel like i can tell you anything romanticactionmovies
read and take note
people who don t have a good taste in movies irritate me
haha no worries luke it was great having you guys there p s i missed my bus and ended up paying at goodgod til sunrise

Hold that to the side for a moment. To get the n-gram counts for this corpus, you can use the same strategy as before for the titles, but cleaning up the tweets as above.

$ cat example1k_statuses.txt | tr 'A-Z' 'a-z' | perl -pe "s/n/ EOS /g" | perl -pe "s/@w+//g" | perl -pe "s/http[^s]+//g"| perl -pe "s/rt//g" | tr -cs 'A-Za-z' 'n' > status_words1.txt
$ sed 1d status_words1.txt > status_words2.txt
$ sed 1d status_words2.txt > status_words3.txt
$ sed 1d status_words3.txt > status_words4.txt
$ paste -d " " status_words1.txt status_words2.txt > bigrams.txt
$ paste -d " " status_words1.txt status_words2.txt status_words3.txt > trigrams.txt
$ paste -d " " status_words1.txt status_words2.txt status_words3.txt status_words4.txt > quadrigrams.txt
$ grep -v EOS quadrigrams.txt | sort | uniq -c | sort -nr > top_quadrigrams.txt

You can now look at the file top_quadrigrams.txt to see which are the most frequent quadrigrams, or grep for quadrigrams containing specific words. And, you can do the same basic process to get bigrams and trigrams, of course.

Going back to the simplified tweets, we have what we need to train an n-gram language model. First, we train the model.

$ ../../maul.sh run edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText 3 lm_twitter.arpa simplified_statuses.txt

You’ll see some output as the model is trained. We next need to convert the ARPA formatted version of the model into binary format.

$ ../../maul.sh run edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm_twitter.arpa lm_twitter.bin

Finally, we apply the trained model to the (simplified) SXSW proposal titles and then recombine the per-title average-log-probabilities back with the original titles so that we can explore them.

$ ../../maul.sh run maul.lm.OutputAverageLogProbabilityPerLine lm_twitter.bin simplified_titles.txt > probs_per_title_from_twitter_lm.txt
$ paste probs_per_title_from_twitter_lm.txt titles.txt > scored_titles.txt

Here are the most suprising (lowest probability) titles:

$ sort -n scored_titles.txt | head -5
-101.143585 Legend
-101.143585 Mastervation
-101.143585 Mecosystem 2020
-101.143585 Netwalking
-101.143585 STEM++

The least surprising ones:

$ sort -nr scored_titles.txt | head -5
-1.8924396 You Don't Know App
-1.926368 I'm with the Brand
-1.9575752 A 360 of the next 365
-2.0284777 It's Not Where You Eat, It's What You Eat
-2.1068375 It's The End of The Internet As We Know It...

And several close to the median probability:

$ sort -n scored_titles.txt | tail -1400 | head -5
-31.45754 College Fan Data: Where Passion Meets Wallets
-31.415817 "You Can't Make This Up" - Animated Shorts in EDU
-31.316021 Adaptive Learning and MOOCs: Moving Beyond Videos
-31.29952 USA/Europe co-production and funding opportunities
-31.292368 Common Skills Language: Key to Sourcing Talent

Reminder: this is a based on a language model trained on just 1000 tweets–you really need more text to get a better model. But once you get more data, you can follow the same recipe as above.

As an exercise, you can train a language model using the SXSW proposal titles themselves, and then applying it back to the titles again. (To be very clean, you would need to do this with a leave-one-out setup, but feel free to be sloppy and just train a single model and apply it.) The reason this would be interesting is to see, given all the titles for this year, how unique is each one given the others in its cohort?

That’s it! This is a taste of some of the basic hacking one can do to play around with text data–the sort you can do on a plane ride (see Peter Norvig’s How To Write A Spelling Checker for a classic post written on a plane). Language models and and topic models are great building blocks for more complex models, and if you can hack some Java and perhaps some Scala, you now have two great implementations (with friendly Apache licenses) to work with as a starting point. Start digging into the probabilistic foundations of these models, working on assignments for natural language processing and their application, check out some online courses like Dan Jurafsky and Chris Manning’s NLP Coursera course (or even go to a graduate program), and you’ll be on your way to data sciencing for yourself!

Request a demo of the People Pattern platform here.

How to Perform Basic Text Analysis

Recent Posts

Recent Comments