You can be anyone and anything you want to be on the Internet, right? Sure, but you'll have a pretty tough time concealing your gender from a clever program which could identify if you're male or female based on as little as a single tweet.
FastCompany reports that researchers from the Mitre Corporation will present a paper in which they discuss a method of identifying the genders of Twitter users at the Conference on Empirical Methods in Natural Language Processing in Scotland soon.
Before we get into the details of the method and paper though, let's clarify something. The terms "sex" and "gender" have generally accepted meanings — as outlined by the World Health Organization here:
"Sex" refers to the biological and physiological characteristics that define men and women.
"Gender" refers to the socially constructed roles, behaviours, activities, and attributes that a given society considers appropriate for men and women.
To put it another way:
"Male" and "female" are sex categories, while "masculine" and "feminine" are gender categories.
In the case of the paper we're discussing — which is entitled "Discriminating Gender on Twitter" — the terms appear to have been used interchangeably. The researchers created a dataset based on Twitter users whom they could classify as either male or female — meaning they knew the individuals' sexes — while appearing to focus on identifying gender (as in masculine and feminine traits). This leaves us a little bit confused and wondering about how transgendered individuals would fit into the dataset among other things.
But that oddity and discrepancy aside, here's what the researchers figured out after creating their dataset:
The dataset was about 55% female, 45% male (which squares roughly with estimates of Twitter's overall gender breakdown). Thus, by guessing "female" for every user, a computer would be right 55% of the time.
But then they actually had a computer analyze the data and make guesses. The results were a bit surprising:
Simply by examining the full name of the user, a computer was accurate about 89% of the time--a remarkable improvement, if not an especially interesting one, since first names are highly predictive of gender. The Mitre findings become intriguing, though, when the team limited its analysis to tweets alone. By scanning for patterns in all the tweets of a given user, Mitre's program was able to guess the correct gender 75.8% of the time--a 20% improvement over the baseline. And even just by analyzing a single tweet of a user, it was right 65.9% of the time--an over 10% improvement over the baseline.
It gets even better though! By letting the computer analyze more than one data field before making a guess, the researchers were able to boost its accuracy to a whopping 92 percent:
Crazy, right? The reason the program used could be so accurate is that our word choices and character usages give us away. For example, females are more likely to use exclamation marks and smiley faces in tweets while males are more prone to include the word "google" in their Twitter posts:
Well, perhaps that particular tidbit is a bit amusing and demonstrates one of the difficulties of gathering data from a social media service, but no matter: The research results remain the same, as does the accuracy demonstrated.
So you might as well give up on attempting to act as if you're a hot babe if you happen to be an older gentleman — or vice versa — before your tweets give you away.
- Murdoch's alleged attacker's final tweet: #Splat
- How Twitter users see America
- 'Internet Shame Insurance' prevents embarrassment
Rosa Golijan writes about tech here and there. She's obsessed with Twitter and loves to be liked on Facebook. Oh, and she can be found on Google+, too.