Trends in Online Dating!
See the OkCupid NLP repo for the code and slides for this project!
The Question
My first exposure to data science was through OkTrends, a blog by OkCupid. OkCupid is a dating website that focuses on compatibility between users via answers to multiple choice questions. I loved the concept of using data to analyze human behavior, particularly in the high stakes world of dating! As a result, when it came time to choose a project to utilize NLP (Natural Language Processing) techniques and I stumbled upon an OkCupid dataset with lots of text features, I knew exactly what to do.
Being single and having an upcoming birthday, I wanted to see how dating changed with age; my guess was that users had more concrete, serious relationship goals as they got older, but I was curious to see how self-presentation changed. In addition, since I had spare time and the framework could easily be applied to other demographic information, I explored how dating profiles changed with sex and how profiles changed with pet preference.
The Data
The dataset consisted of 60,000 OkCupid profiles from users in the San Francisco Bay Area, scraped in June 2012. The dataset had basic demographic information like age, gender, education level, but also some quirkier features, like pet preferences and zodiac sign. Most importantly, the dataset had the text of the users’ profiles! OkCupid profiles are broken out into sections, where users respond to different prompts. For an example of what this looks like, see my “OkCupid profile” below.
I chose to focus on a handful of sections that I thought would be most insightful: Self-Summary, I’m Really Good At, What People First Notice About Me, I Spend A Lot Of Time Thinking About, and Message Me If. Each section has its own purpose; for example, I’m Really Good At shows what users value the most about themselves, What People First Notice About Me, in my opinion, is a reflection of what users first notice about other people, and Message Me If is a concise section to show what you’re looking for from the website and a relationship!
The data itself wasn’t too bad to deal with. I had to use a lot of regex tools to get rid of HTML tags, links, punctuation, and some strange formatting, but it was pretty straightforward! I also used lots of standard NLP techniques like lemmatization and part-of-speech tagging. I did have to perform some undersampling, as there was a large gender imbalance in the dataset that changed with age; it was 60/40 male/female overall, but the gap greatly decreased with age. Since I was modelling on age, I didn’t want age to just be a proxy for gender, so undersampling really helped!
Modeling
Now, it’s time for the most satisfying part of the data science process- modeling!
For this project, I performed topic modeling via NMF (Non-negative Matrix Factorization) to understand the themes in the profiles and to see how the presence of these themes changed with age.
I also used a metric derived from Scattertext called the scaled F-score. This score is a measurement of how unique a term is to a specific category and how commonly it is used within that category. For example, for profiles for users aged 18-25, ‘Mario Kart’ is seen more much often compared to profiles of other age groups. In addition, ‘Mario Kart’ is used very frequently out of the all terms used in profiles for users aged 18-25. As a result, ‘Mario Kart’ will have a high F-score for users aged 18-25!
The F-score will range from -1 to 1. An F-score closer to -1 or 1 indicates that the term is more unique to a particular group and is used often, while a score closer to 0 indicates that the term isn’t specific to any group and/or isn’t used frequently. If you’re curious about the math behind the F-score, feel free to reference this!
Results for Age
Let’s start with looking at how dating changes with age. For age, I categorized users into two age groups: 18 to 29 and 30+. In our data set, 30 was the median age. Terms with a score closer to -1 are more likely to be associated with the 18 to 29 age group while terms with a score closer to 1 are more likely to be associated with the 30+ age group. Also, I’m aware that the graphs are a bit garish, but I was excited about Pride month!!
- Age- Message Me If
-
Older users seem to be looking for more serious relationships
-
Age- I’m Really Good At Trending
-
There a transition from fun skills to more practical and nurture-driven skills with age
-
Age- I’m Really Good At
-
Once again, we see a transition from fun skills and hobbies to social and practical skills
-
Age- What People First Notice About Me
- As users get older, they seem to notice personality attributes more often, as opposed to physical attributes
In all honesty, this exercise made me a bit less scared of dating as I get older! The differences I found between age groups also explained, in part, how the dating app market has formed, with apps like Tinder appealing to a younger audience and other services like Match.com appealing to an older audience. Each app serves different needs and priorities, with younger audiences being more physically driven and adventurous and older audiences looking for more emotional intelligence and substantial relationships.
Results for Pets
Now, on a lighter note, let’s shift our focus to terms most associated with dog and cat people! Someone like myself, who likes both cats and dogs, would not be included in this data; only users who showed positive feelings towards either dogs or cats were included.
- Cat people show more interest in science and culture, dog people seem more athletic, traditional, and social
Results for Sex
I found these results to be a bit stereotypical, but I still think they’re worth sharing! Note, in 2012, when the data was collected, OkCupid only had two options for sex: male and female. Today, OkCupid has many more options.
- Sex- What People First Notice About Me
-
Females tend to write about both physical attributes and personality; males are much more likely to write about physical features (including eyelashes?)
-
Sex- I’m Really Good At
-
Parallel parking and hula hoop?
-
Sex- I Spend A Lot Of Time Thinking About
- Males seem to be more grandiose- see ‘jack (of all) trades’ and ‘fix anything’ in prior set of graphs and ‘life universe’ here
Farewell!
I hope you enjoyed my analysis and, as always, feel free to reach out if you have any thoughts or questions!