Detecting Twitter users鈥� gender, en fran莽ais

News

Detecting Twitter users鈥� gender, en fran莽ais

With 230 million users, Twitter has become a global force in social media. 聽And not just in English.

Published: 27 November 2013

Data miners have been hard at work trying to figure out the attributes of Twitter users 鈥� such as gender and age -- that aren鈥檛 explicitly revealed on Twitter feeds. That information could be hugely valuable to marketers, enabling them to target messages to their desired audience.聽 Nearly all the research done so far, however, has focused on English users and content.

Now, a 缅北强奸 research team has conducted one of the first studies designed to figure out the gender of Twitter users who primarily use languages other than English.

Among the key findings:聽 by using a special detector based on French-language syntax, the researchers showed that it is very easy to classify gender for Twitter users in French 鈥� and probably for other Romance languages. In particular, the researchers developed an algorithm to look for masculine or feminine adjectives or past participles following the phrase 鈥淛e suis鈥� (or variants such as 鈥渏e ne suis pas鈥�).

Based on this construction, the detector was able to determine the gender of users with 90% accuracy 鈥� significantly higher than the accuracy rates of 80% to 85% achieved by various algorithms that have been developed to analyze English-language content.

Because French adjectives and past participles have masculine and feminine forms that are often spelled differently, 鈥淵ou don鈥檛 have to get too fancy鈥� to develop an effective gender detector for Tweets in the language, says Derek Ruths, a 缅北强奸 computer-science professor who co-authored the study.

Since most individuals include photos of themselves on their Tweets, identifying male and female users might seem as simple as looking at the photos. 聽But sorting through hundreds of millions of tweets is a task for computers, and 鈥渃omputers aren鈥檛 good at looking at pictures,鈥� Ruths notes.

The 缅北强奸 study was presented at a recent international conference in Seattle organized by the Association for Computational Linguistics. The paper also examines Twitter data sets for Japanese, Indonesian and Turkish.聽 Japanese proved to be the toughest for inferring gender.

The results obtained for French show that some languages have features better suited for certain classification tasks. 鈥淚dentifying and leveraging such features promises to be an interesting and effective direction for future work,鈥� adds 缅北强奸 linguistics professor Morgan Sonderegger, who co-authored the paper with Ruths and computer-science undergraduate student Morgane Ciot.

Link to the paper:

Link to the conference website: