Understanding Multilingual Communities through Analysis of Code-switching Behaviors in Social Media Discussions
A Track in the IEEE Big Data 2019 Big Data Cup
Data
The first ground truth dataset,Dataset1.json, will be mailed to you when you register. It consists of 137 discussions with a total of 302 posts. Languages include mainly English, Spanish, French and Italian, with other languages appearing sparsely. More datasets may become available as the competition progesses.
The Dataset1.json file consists of one JSON object per line. Each JSON object describes a discussion and provides ground truth
for some of the posts in the discussion. An example of the format is here. Generally the structure is:
{
"discussion": {
"discussion": [
{
"username": "@user2",
"timestamp": null,
"in_reply_to_username": null,
"in_reply_to_status_id_str": null,
"id_str": "1108371563617886208",
"indentation": 1,
"text": "[TEXT UNAVAILABLE]"
},...
]
},
"ground_truth": [
{
"consensus": {
"tokens": [
{
"lang": "und",
"word": "@user1"
},...
],
"postOpposes": false,
"postSupports": false,
"relevance": 4,
"postProvidesAdditionalInfo": true,
"postQuestions": false,
"postFullSentence": true
},
"post_id": 3
},...
]
The inner discussion array consists of a number of text posts with the fields shown in the example. The posts are ordered in "forum
order", and each post can be seen to reply to another, apart from the first post. In some cases there is very little information known about the first post, and in these cases many of the fields are blank/default fields with "[TEXT UNAVAILABLE]" for the text. The ground_truth array consists of a number of ground truth objects, each containing the the post_id in the discussion, which is a number from 1 to the number of posts, and the ground truth data in the consensus field for that post. Not all posts in the discussion have supplied ground truth. Further details of the ground truth will be explained below.
Tasks
You task is to correctly predict the ground truth for a set of discussion posts. The first test data set is calledTestData1.json
and will be mailed to you when you register. It has the same format as the Dataset1.json file, other than the ground truth has been
deleted. Not all posts in the discussions are required to be labeled. You need to create a consensus field that matches the actual consensus field in the ground_truth field that we keep secret.
The scoring script in Python is avialable here. The script provides a score for a set of discussions, e.g. in Unix:
$cat Dataset1.json | score.py
scored 137 discussions with a total of 302 scored posts: 1.000000
In this case, the score script was used to score the actual ground truth (if a predicted_ground_truth field is not found, it just uses the ground_truth field against itself), with a perfect score which is expected.
The are four tasks for each post:
- Label each of the words in the post
textfield with a language label, that is either en (English), fr (French), it (Italian), es (Spanish), und (Not a language), oth (Other language). The text MUST be split on white space in order to obtain exactly the same tokens that is present in the ground truth. The format for the labels is an arraytokens:
The label objects in the"tokens": [ { "lang": "und", "word": "@user1" },... ],tokensarray are in the same order as they appear in the text. For names of places/persons/organizations, label with a language if that would be considered the usual use of the name in that language. The following fragments should not be labelled: screen names, hashtags, urls, emoticons, numbers and any other text that is not a dictionary word or discussed in these instructions. Mispelled words, abbreviations, slang, filler expressions like "Ummmm" and acronyms (e.g. "lol") should be assumed to be in the language that they likely represent. - Provide a True/False value for the field
postFullSentenceif the post contains at least one meaningful sentence or clause. - For the field
relevance, select a "relevance" category that best represents the text post within the context of the discussion, from the categories 5 (The post is clearly relevant to the discussion), 4 (The post is probably relevant to the discussion), 3 (It is unclear if the post is or is not relevant to the discussion), 2 (The post is probably not relevant to the discussion), 1 (The post is clearly not relevant to the discussion). A text post should be considered relevant to the discussion if its content relates to and/or provides meaning that contributes to the discussion. Examples of posts that are not relevant include spam, trolling, posts that have no discernible meaning, etc. You may provide any real number between 1 and 5 inclusively if you wish to be more precise. - Provide a True/False value for the fields accordingly:
postProvidesAdditionalInfoThe post provides a response or additional information to the replied to postpostOpposesThe post opposes the replied to postpostSupportsThe post supports the replied to postpostQuestionsThe post questions the replied to post
Evaluation
For you submission to be evaluated, you must email aharwood@unimelb.edu.au a JSON formatted file, identical in format to theTestData1.json with a predicted_ground_truth field for each discussion. We will evaluate your submission and update your score on the Leaderboard. Please be aware of the closing date for submissions.