Understanding Multilingual Communities through Analysis of Code-switching Behaviors in Social Media Discussions

A Track in the IEEE Big Data 2019 Big Data Cup


Social media facilitates large numbers of discussions and debates that involve the participation of different communities from across the globe. Anybody can contribute to a discussion by responding to a piece of content of any depth that they encounter. Multilingual individuals often tend to utilize alternate languages in a single conversation (code switching) for effective communication in a discussion. Characterizing these discussions helps to analyze the contextual factors affecting the cultural diversity of a community. The variety and diversity of communities engender multilingual discussion, which provides valuable insight into the evolution of trends and public opinion in communities.

Currently, the enormous span of social media usage—while providing valuable resources for linguistic behavior analysis--makes tracking and understanding these multilingual discussions a challenging task. Significant progress has recently been made by developing community detection algorithms based on the network structure of followers and friends, the interactions of retweets and mentions and the hashtags occurring in the discussion context. However, the general applicability of these state-of-the-art approaches has been limited, because the resulting communities are heavily dependent on the type of attributes used in the detection and linguistic characteristics of the discussions are ignored in the analysis.

To code-switch is to change from one natural language to another within a conversational turn at talk. There are diverse inventories of the functions of this behavior, which overlap to varying degrees; there are related but contrasting notions, e.g. “code-mixing” as interspersing of individual second-language words into one's stream of talk; and there are measures of its levels of complexity.

How the linguistic and interactional patterns that occur in a discussion are associated with its social meaning or function is a phenomenon whose study requires large datasets comprised of naturally-occurring examples of code-switching. This is particularly important for modelling in order to build technologies for automatic recognition of that meaning or function. In the age of social media, global communication, and big data, assembling corpora of this type of material should be within reach.


We propose a competition for labelling social media posts within a given set of discussions, such as from Twitter, that involves posts of multiple languages within each discussion, to both correctly identify the languages involved, and labelling each post with a "relevance" score, as to how relevant the post is to the discussion itself. Solutions will need to be able to identify multiple languages and be able to recognize context for individual tweets in a discussion.


The top 3 scoring submissions will receive cash prizes on attending the Conference. Prizes will be offered on condition of winning solutions being open sourced.
  • First best team: US$2000
  • Second best team: US$1000
  • Third best team: US$500