Text Analytics Unlocks the Secrets of Viral Titles

I’ve been meaning to find a chance to look into event titles and see whether some straight-forward Unix commands, text analytics and natural language processing can reveal anything interesting. For the purpose of my experiment, I focused on proposal titles for SXSW 2015 panels.

People reportedly put a lot of thought into their titles since that is a big part of getting your proposal noticed in the community part of the voting process for panels. The creators of proposals for SXSW are given lots of feedback, including things like on their titles.

“Vague, non-descriptive language is a common mistake on titles — but if readers can’t comprehend the basic focus of your proposal without also reading the description, then you probably need to re-think your approach. If you can make the title witty and attention-getting, then wonderful. But please don’t let wit sidetrack you from the more significant goals of simple, accurate and succinct.”

In short, a title should stand out while remaining informative. It turns out that there has been research in computational linguistics into how to craft memorable quotes that is interesting with respect to standing out. Danescu-Niculescu-Mizil, Cheng, Kleinberg, and Lee’s (2012) “You had me at hello: How phrasing affects memorability” found that memorable movie quotes use less common words built on a scaffold of common syntactic patterns (BTW, the paper itself has great section titles). Chan, Lee and Pang (2014) go to the next step of building a model that predicts which of two versions of a tweet will have a better response (in terms of obtaining retweets) (see the demo).

It would be fun and interesting to go to those lengths with SXSW titles, but in the interest of time, I’m going to take a very simple approach here to give a flavor of the kinds of things that can be done. Let’s start with the data: Emily Tagtow kindly scraped 2847 titles from the SXSW PanelPicker voting site. There may be a significant portion that were missed in the process, but it’s a fair number to work with.

So, what can we do with titles? First, we can use them to compute a topic model to see what the main themes are. Here’s the output from using Mallet’s LDA implementation.

brand, mobile, content, experience, UX, creating, design, art, marketing, lessons
big, data, internet, things, marketing, small, startup, global, designing, iot, lies
data, open, privacy, innovation, ed, change, higher, future, don, social, healthcare
social, media, digital, culture, journalism, good, public, breaking, stories, modern
content, digital, distribution, start, people, free, today, make, making, long
smart, home, guide, future, drones, business, ceo, collaborative, superheroes, gov
building, tech, digital, st, century, education, college, gap, engagement, ed, video
learning, education, design, students, classroom, innovation, tech, online, science
women, style, film, meet, community, government, customers, place, takes, indie
music, future, tech, digital, world, brands, make, making, time, work, social, changing

Sure enough, we have at least one topic for each of the main areas of SXSW (interactive, edu, music and film). The proportions are about right too: 68% of the proposals were submitted for interactive and 6 of the 10 topics align with interactive topics, 25% were for edu and 2 topics align with that, then there is one topic for each of music and film remain.

The topics give a sense of the titles at a glance, but we want to try to create an algorithm that teases out better titles from worse ones. For this, we start by asking which words are the most frequent. After very crude tokenization, and lower-casing all words, and filtering out stop words (e.g. “the”, “of”, etc.), the top ten words are the following.

128 social
125 data
113 digital
111 tech
105 future
99 learning
87 design
84 music
71 education
69 big

Yep, that looks like SXSW, alright. (Don’t worry: “media”, the all important companion of “social”, is number 11, with 68 occurrences.)

How about pairs of words, which we term “bigrams” in computational linguistics? This time leaving in the stopwords, the top ten are the following.

108 how to
79 the future
65 future of
60 in the
59 and the
47 of the
40 social media
36 the new
36 big data
28 the world

So, clearly there are a bunch of folks wanting to tell us “how to X” and talk about the “the future” and even specific “future of” thises or thats. Interestingly, a favorite buzzword of the moment, “big data” is pulling up on “social media”. And 28 panels have something to say about “the world”. Let’s look to trigrams (triples of words) to see what they are talking about. Here are trigrams that have “the world” in them and are used in at least two panel titles.

5 save the world
4 the world s
3 a digital world
2 world s first
2 the world of
2 the world cup
2 over the world
2 change the world

Who are these five that want to talk about saving the world?

Project Based Learning-Helping Kids Save the World
Can Digital Save the World?
Save the World: Pants Optional
How Cool Gadgets Can Save The World
Toilets and Trash-Will 3D Printers Save the World?

A mix of idealistic, hyberbolic, and humorous — all probably figuring that the use of “save the world” would be eye-catching (and possibly right in doing so).Let’s last turn to quadrigrams. Here are all that occur in three or more proposals.

6 the internet of things
6 and the future of
5 the future of work
5 in the digital age
4 the future of the
4 in the age of
4 how to create a
4 how to build a
3 the future of retail
3 the future of music
3 the future of brand
3 the age of the
3 in the st century
3 in a digital world
3 how to make a
3 how to be a
3 for the internet of
3 change in higher ed

It’s no particular suprise to see a multiword expression like “the internet of things” showing up six times, but it is perhaps a bit more suprising that three proposals include “the future of retail”, or music, or brand (not suprising for “work”—people have been obsessed by that since, well, forever). Looking at the actual titles, one is grandly focusing only on “the future of retail” while the others give a bit more detail.

The Future of Retail
Death to the Register: The Future of Retail
The Future of Retail Analytics

Anyway, it seems that creators of SXSW titles feel they should be providing how-to’s or talking about the future our our current amazing age, which comes in many flavors.

digital age
media age
social media age
golden age
exponential age
entrepreneurial age
age of proximity
age of gifs
age of engagement
age of anyshoring
age of the woman
age of the tech
age of the selfie

So, of all the titles, who came up with novel, interesting ones? As a simple attempt, let’s identify titles that use word sequences that are unexpected. There are many ways to do this, but the bread-and-butter method in computational linguistics is to use an n-gram language model. Such models are used all over the place in natural language processing applications, including speech recognition, spelling correction, optical character recognition, machine translation and more. Basically, they predict the probability of the next word given the previous n-1 words, and this allows them to predict the probability of an entire sentence.

An n-gram language model is trained on some example text, often millions or billions of words long. When we ask the model to predict the probability of a new text, it assigns higher probability to it if it looks like that training data, and lower if it doesn’t. In machine translation, this helps with determining whether the Portuguese sentence “estou com tanta fome” is better translated as “I’m with such hunger” or “I’m so hungry”. The former is more or less a word-for-word translation (and thus gets high probability from a translation model), but the latter looks more like how English is generally spoken (and thus gets higher probability by a language model trained on lots of English text).

In the case of the SXSW titles, we can be reasonably sure that they are all in English, so what a language model can do is tell us when one is more surprising (low probability) or very much like what it has previously seen (high probability). Of course, the training data matters very much here. To get relevant but non-overlapping text, I grabbed about two million tweets that were produced by people who follow the @sxsw account and created a trigram language model using BerkelyLM. Using the trained model, we can rank the titles in terms of whether the model thinks they look familiar or unfamiliar. A key thing to realize here is that the model hasn’t seen the titles, nor has it seen text from the time after the titles were submitted (since after that it is possible we’d see the titles in tweets promoting or discussing them).

Before ranking titles with a language model, here are the counts in roughly 2 million tweets of some of the n-grams discussed above.

6286 social media
797 big data
364 change the world
131 the internet of things
31 the future of work
77 in the digital age
70 save the world
61 how to create a
3 the new music economy

For comparison, some top bigrams, trigrams and quadrigrams in the corpus are “in the” with 47,231 occurrences, “thanks for the” with 6305, and “just posted a photo” with 4309.

Now, let’s score the titles for the novelty and look at the top and bottom of the lists. The score I use is the log-probability of the tweet divided by the length of the tweet. This helps, imperfectly, with allowing comparison between titles with different numbers of words. One way to think about this: how surprised is the model, on average, as it sees each word in the title?

I’m going to show some actual titles of SXSW proposals here and comment on them, including whether I like them or not. In doing so, I am not passing judgment on the quality of the proposals themselves—really I’m just doing a gut check sense of my own response to the titles. With that in mind, here are the ten most surprising titles according to the model.

Mastervation
Mecosystem 2020
Netwalking
Micro Visulaizations
IBM Watson: Lnkiing lngagaue and ihsgnits
Tech + Art: Democratizing Digital Entertainment
Effective Partnerships: Charters and ISDs
SchoolAsAService: Supporting Student Success@Scale
VodkaforDogPeople: Matchmaking a business heart
Smart Minds on Smart Homes

The first three are all single word items that didn’t occur in the training material (note that the 2020 was stripped from the input). These are maximally surprising to the model, but aren’t very interesting to me on sight. The next two have spelling errors, so they also have a large proportion of words that are unknown to the model. In the case of the IBM Watson title, the misspellings are intentional; for me, as a natural language processing researcher, this is done for an interesting reason, but I wonder whether this would be as interesting for a general audience—I’m guessing most people wouldn’t get it. SchoolAsAService and VodkaforDogPeople create suprising words by putting several words together, and their subtitles use relatively common words, but in uncommon combinations. In all, I don’t think any of these are especially attention-grabbing titles. How about if we look at the titles that are the least surprising?

What is Wrong with Me?
How To Survive A Zombie Apocalypse
Don’t like change? It’s time to change your mind.
If Content is King, then Context is God
It’s The End of The Internet As We Know It…
How tech is changing the way we listen to music
How to Find Your Happy Place
How to Measure Content Marketing Success
I’ll Show You Mine if You Show Me Yours

At Some Point You Are Going To Need A TeamFor me, none of these strike me as both enticing and descriptive. The title “It’s the End of the Internet as We Know It” does follow the recipe that Danescu-Niculescu-Mizil et al. (2012) found to be effective for memorability: take a reasonably common expression and switch it up a bit with an unlikely word or two. However, in this case, the expression is already pretty worn-out, the new word “internet” is quite likely, we don’t get much of an idea of what is bringing about the end, and we don’t even know if the proposer feels fine about it.

A side note: high probability could mean that one is evoking a concept that is currently of general interest (e.g. “the internet of things”). For the latter, this helps surely in making the proposal interesting to a lay audience; however, it would seem reasonable to me that choosing a quadrigram that five other proposals have also used will put them into direct competition with those others. That is, a judge might decide that they will choose just one of the “the internet of things” or “the future of work” proposals. Future proposers can guard against this possibility by checking n-gram counts in tweets in the months before submission, and avoiding those with high counts.

Interim conclusion: neither the most suprising nor the least surprising titles according to a trigram language model are likely to be the best in the batch. So, this attempt at separating the compelling from the not is not a success. This sort of situation often happens when attempting to solve problems in natural language processing, and the next step is to refine things a bit. Perhaps we need to find titles that are in the middle of suprisingness so that they can be fresh without being unfamiliar. We can get at that by taking the titles that have a score near the median score.

Grades are STILL stupid! We can assess better.
Consumers Power the Media. Now What, Brands?
How Disney Animation Studios Reinvented Itself
The Future of Feel Good: Innovative Music Orgs
Prototyping for the (emerging tech) win
Beyond a Flipped Classroom: Students as Teachers
Going Global: The Secrets to International Success
Navigate the college market at a student run venue
Reconciling Reality and Rhetoric of Student Loans
Pushing and Pulling: The New Listening Experience
Disrupting Innovation: Book Publishing and New Media
Wear to Learn: The Body as Interface
Future of Health Tech:Creating Predictive Devices
Mind the Gap: Faculty and Student Tech Experiences
Ready to wear? Body informed 3D printed fashion
The Biggest Issues in Digital Ethics
Making music physical again
Measures that Matter: Capturing Student Feedback
Student Data: When Innovation Meets Privacy Law
Sex, Wine and Donuts: Adventures in Weight Loss Data

I pulled this list without looking and did not revise it or how I extracted it. There are a several goods titles in this batch, especially the first, which is both interesting and descriptive. Several of these have a catchy headline with a more informative byline, and none have (unintended) spelling errors (which I have noticed is oddly a problem for quite a few titles in the data). This indicates that being in the middle of the pack in terms of surprisingness is likely a good feature for a good title. By combining it with other ideas explored in the two papers mentioned above, and by further getting the data on which proposals were accepted and rejected in the past, I’m sure a pretty good model for automatically rating titles (or at least saying which of two given titles is better) can be built.How about the proposal “Are you in a social media experiment?” that I’m part of with Philip Resnik, Jennifer Golbeck, and Michelle Zhou? It’s the 65th most unsurprising. As a sentence, it is a reasonably high probability one, but the phrase “social media experiment” is not only fairly unique, but it is also current and relevant because of the recent Facebook contagion experiment kerfuffle. Also, it is active, since it is about “you”, the reader/listener, and it is asking a question. I personally think it is a great title—kudos to Philip Resnik for coming up with it! And, if you’ve made it this far, and have done so before the end of Friday, September 5, please check it out and vote for it if you would like to see it at SXSW 2015. (Thanks!)

For anyone interested in learning how they could do some of this themselves, I have done a follow-up post “Data Sciencing by Numbers: A walk-through for basic text analysis” that walks through how I obtained the counts, probabilities, and topics using Unix commands, BerkelyLM, and Mallet.

Request a customized demo of the People Pattern platform here.

Recent Posts

Recent Comments