Innovating innovation with the world’s leading minds
Fujitsu AI-NLP Challenge

The challenge is now closed.

Background

Fujitsu Ltd., a leading Information and Communications Technology company, develops and provides AI related services and products. Fujitsu’s AI technologies are branded as FUJITSU Human Centric AI Zinrai, and provide a diverse set of AI functions.

One of the Zinrai platform service APIs includes a FAQ search. One important application of the Zinrai FAQ search is to improve response times and accuracy in call centers.

Fujitsu is calling for entries to strengthen its FAQ search – we are challenging you to develop novel and strong natural language processing technologies to complement its original technology in Zinrai FAQ search.

Problem Setting

Question answering (QA) is a crucial task in natural language processing that requires both natural language understanding and world knowledge. Previous QA datasets tend to be high in quality due to human annotation, but small in size [1] and [2]. Hence, they do not allow for training data-intensive, expressive models such as deep neural networks. Finally, most of these datasets are constrained in the number of examples and scope of topics.

The SelQA [3] dataset presents a corpus with annotated question answering examples of various topics drawn from Wikipedia. A broader context extraction together with an effective annotation scheme results a large corpus that is both challenging and realistic.

Your goal is to develop a QA algorithm based on the selection-based question answering dataset, SelQA [3]. The dataset introduces a corpus annotation scheme that enhances the generation of large, diverse, and challenging datasets by explicitly aiming to reduce word co-occurrences between the question and answers. The accompanying paper [3] compares several systems on the tasks related to answer sentence selection (ASS) and answer triggering (AT), providing strong baseline results for this challenge.

As part of the challenge you are to solve only the Answer Sentence Selection problem. That is to calculate a probability of correctness [0,1] for each sentence in the question for the accompanying answer set. For example if for particular question there are 3 answers and only the 2nd one provides an answer to the question, then the optimal result would be [0, 1, 0], but any variation of the results preferring the 2nd answer over the rest, for instance [0.1, 0.8, 0.2], will be awarded using an MRR score (see evaluation process).

Data Sets

The SelQA dataset consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia [3]. A total of 486 articles are uniformly sampled from the following 10 topics of the English Wikipedia, dumped on August, 2014: Arts, Country, Food, Historical Events, Movies, Music, Science, Sports, Travel, TV.

The original data is preprocessed into smaller chunks, using the section boundaries provided in the original dump and segmented into sentences by the open-source toolkit, NLP4J, resulting in 8,481 sections, 113,709 sentences and 2,810,228 tokens.

For each section, a question that can be answered in that same section by one or more sentences was generated by human annotator. The corresponding sentence or sentences that answer the question was selected. As an additional noise process, annotators were also asked to create another set of questions from the same selected sections excluding the original sentences selected as answers in previous task. Then all questions were paraphrased using different terms, in order to make sure the QA algorithm would be evaluated by reading comprehension rather than ability to model word co-occurrences. Lastly if ambiguous questions were found, they were rephrased again by a human annotator.

Example process:

  1. The premiere episode was met with mixed reviews, receiving a score of 42 out of 100 on aggregate review site Metacritic, indicating “mixed or average” reviews.

  2. Dorothy Rabinowitz said, in her review for the Wall Street Journal, that “From the evidence of the first few episodes, Criminal Minds may be a hit, and deservedly”...

  3. The New York Times was less than positive, saying “The problem with Criminal Minds is its many confusing maladies, applied to too many characters” and felt that “as a result, the cast seems like a spilled trunk of broken toys, with which the audience - and perhaps the creators - may quickly become bored.”

  4. The Chicago Tribune reviewer, Sid Smith, felt that the show “May well be worth a look” though he too criticized “the confusing plots and characters”.

For the initial task, the following question was compiled for the answer found in the 1st sentence: How was the premiere reviewed?
Then additional question was added based on 2nd sentence: Who felt that Criminal Minds had confusing characters?
After that the questions were paraphrased into: “How were the initial reviews?” and “Who was confused by characters on Criminal Minds?”
And lastly, the 2nd one was rephrased to “How were the initial reviews in Criminal Minds?” The 1st one remained unchanged - “How were the initial reviews?”

The dataset can be found here.

For training (train), validation (dev) and evaluation (test), use the data files named SelQA-ass-train.json, SelQA-ass-dev.json and SelQA-ass-test.json respectively.

Each JSON data file is a list of dictionaries, where each dictionary is a single question entity. In the dictionary, the following elements can be found:

question - question text
sentences - set of sentences of the given section for this question
candidates - set of zero based indexes into sentences attribute, that provide the answer for given question.
is_paraphrase - whether the question is a paraphrase of another question
type - genera of the question (one of several of the following: MUSIC, TV, TRAVEL, ART, SPORT, COUNTRY, MOVIES, HISTORICAL EVENTS, SCIENCE, FOOD)
article - name of the Wikipedia article the question is about
section - name of the Wikipedia section the question is about

Examples and clarifications:

Several sentences may contain an answer to the question, as in the following example:

{'section': 'Nominations and awards',
'question': 'Which Academy Award did Kevin Spacey win for his work on The Usual Suspects?',
'candidates': [0, 1],
'sentences': [
'Christopher McQuarrie was nominated for the Best Original Screenplay and Kevin Spacey was nominated for Best Supporting Actor at the Academy Awards.', 'They both won, and in his acceptance speech Spacey memorably said, "Well, whoever Keyser Söze is, I can tell you he\'s gonna get gloriously drunk tonight.',
'"McQuarrie also won the Best Original Screenplay award at the 1996 British Academy Film Awards.', 'The film was also nominated for Best Film, and best editing.', 'It won for best editing.', 'The film was nominated for three Independent Spirit Awards — Best Supporting Actor for Benicio del Toro, Best Screenplay for Christopher McQuarrie and Best Cinematography for Newton Thomas Sigel.', 'Both Del Toro and McQuarrie won in their categories.', '"The Usual Suspects" was screened at the 1995 Seattle International Film Festival, where Bryan Singer was awarded Best Director and Kevin Spacey won for Best Actor.', 'The Boston Society of Film Critics gave Spacey the Best Supporting Actor award for his work on the film.', 'Spacey went on to win this award with the New York Film Critics Circle and the National Board of Review, which also gave the cast an ensemble acting award.'],
'article': 'The Usual Suspects',
'type': 'MOVIES',
'is_paraphrase': False}

Meaning both sentences 0 and 1 provide answer to the question.

Evaluation

Due to increasing complexity in question answering, deep learning has become a popular trend in solving difficult problems. Two recent state-of-the-art systems based on convolutional and recurrent neural networks were implemented to analyze this corpus and to provide strong baseline measures [3]. The details of the implementations were largely based upon [4] and [5].

As an evaluation metric, mean reciprocal rank (MRR) was chosen, being the average of the reciprocal ranks of results for a sample of queries Q:

MRR=1|Q|i=1|Q|1ranki

where ranki refers to the rank position of the first relevant document for the i-th query.

The measure evaluates the process, which produces a list of possible responses to a sample of queries, ordered by probability of correctness.

Deliverables:

Your submission must include:

  1. Your Python program
  2. A MS Word (template here) or LaTeX (template here) document describing your algorithm
  3. The output file SelQA-ass-result.json resulting from your program processing the data file SelQA-ass-test.json
  4. Your MRR score (mrr.txt as generated by: python evaluation3.py SelQA-ass-result.json > mrr.txt)

To ensure the validation process is uniform and fair, you are provided with a evaluation program to process the file in the following format:

the result of the algorithm should be a JSON file of the same format as the input with an overloaded attribute "results", which will contain a list of probability of correctness.

For example, for the entity mentioned previously, attribute results will be added.

{'section': 'Nominations and awards',
'question': 'Which Academy Award did Kevin Spacey win for his work on The Usual Suspects?',
'candidates': [0, 1],
'sentences': [...],
'article': 'The Usual Suspects',
'type': 'MOVIES',
'is_paraphrase': False,
'results': [0.9, 0.8, 0.1, 0.3, 0.3, 0, 0.5, 0, 0, 0.1]}

Clarifications:

As the evaluation process will also be based on the dataset, submissions which try to memorize the dataset will be disqualified.

Along with your submission, please provide a document in MS Word (template here) or LaTeX (template here) which describes the methodology in detail, tools used and any details of initial setup of required libraries.

The submission should be implemented in Python language (either 2.7 or 3). You may rely on any 3rd party library, as long as it’s open source.

Although the paper writers used a Java NLP library, you are to to implement everything in Python, including all pre-processing and data manipulation procedures.

One submission should be provided for a single team.

You are not permitted to share information with other teams.
Each team may submit as many times as they wish.

You are to solve only Answer Sentence Selection problem.

The paper [3] provides evaluation MRR scores for 10 different algorithms, ranging from 83.18 to 87.59.

The submission script will read file named "SelQA-ass-test.json" and produce "SelQA-ass-result.json" of the same format as the input with an overloaded attribute "results", which will contain a list of probability of correctness.

Your submission script must be written in Python (.py file)

For instance for the input question list of a single question:

{'section': 'Season 4',
'question': "What happened first, Ben and Leslie ending their relationship or Ben becoming Leslie's campaign manager?",
'candidates': [0, 6],
'sentences': ["With Ben's encouragement, Leslie decides to run for city council, and the two end their relationship.", 'Leslie hires Andy as her assistant.', 'Patricia Clarkson appears as Ron\'s first ex-wife, "Tammy One", who uses her power as an IRS employee to trick Ron into thinking he\'s being audited and temporarily takes complete control over his life.', "Tom and Jean-Ralphio's company, Entertainment 720, quickly blows through massive amounts of promotional funding while performing little actual work; the company goes out of business and Tom returns to his old job.", 'After struggling to move on both personally and professionally, Ben and Leslie get back together, and Ben sacrifices his job to save Leslie from losing hers.', "The scandal leads her political advisors to abandon Leslie's campaign, and the Parks Department volunteers to become her new campaign staff.", "Ben agrees to be Leslie's campaign manager.", "Leslie's ex-boyfriend Dave (Louis C.K.)", 'reappears and unsuccessfully attempts to win Leslie back.', "Leslie's campaign faces myriad setbacks against her main opponent, Bobby Newport (Paul Rudd), and his famous campaign manager Jennifer Barkley (Kathryn Hahn).", 'Ann and Tom begin an extremely rocky romantic relationship.', "April takes on more responsibility in the department, shouldering much of Leslie's usual work.", 'In the season finale, Jennifer offers Ben a job in Washington, which he reluctantly accepts, and after the race is initially called for Newport, Leslie wins the position in a recount.'],
'article': 'Parks and Recreation',
'type': 'TV',
'is_paraphrase': False}
}

The result will be:

{'section': 'Season 4',
'question': "What happened first, Ben and Leslie ending their relationship or Ben becoming Leslie's campaign manager?",
'candidates': [0, 6],
'sentences': ["With Ben's encouragement, Leslie decides to run for city council, and the two end their relationship.", 'Leslie hires Andy as her assistant.', 'Patricia Clarkson appears as Ron\'s first ex-wife, "Tammy One", who uses her power as an IRS employee to trick Ron into thinking he\'s being audited and temporarily takes complete control over his life.', "Tom and Jean-Ralphio's company, Entertainment 720, quickly blows through massive amounts of promotional funding while performing little actual work; the company goes out of business and Tom returns to his old job.", 'After struggling to move on both personally and professionally, Ben and Leslie get back together, and Ben sacrifices his job to save Leslie from losing hers.', "The scandal leads her political advisors to abandon Leslie's campaign, and the Parks Department volunteers to become her new campaign staff.", "Ben agrees to be Leslie's campaign manager.", "Leslie's ex-boyfriend Dave (Louis C.K.)", 'reappears and unsuccessfully attempts to win Leslie back.', "Leslie's campaign faces myriad setbacks against her main opponent, Bobby Newport (Paul Rudd), and his famous campaign manager Jennifer Barkley (Kathryn Hahn).", 'Ann and Tom begin an extremely rocky romantic relationship.', "April takes on more responsibility in the department, shouldering much of Leslie's usual work.", 'In the season finale, Jennifer offers Ben a job in Washington, which he reluctantly accepts, and after the race is initially called for Newport, Leslie wins the position in a recount.'],
'article': 'Parks and Recreation',
'type': 'TV',
'is_paraphrase': False,
"results": [1, 0.2, 0.1, 0.3, 0.3, 0, 0.95, 0, 0, 0.1, 0, 0, 0]}
}

The validation program then will read the result file "SelQA-ass-result.json", and calculate overall MRR score according to candidates and results fields in each section.

Good luck!

References

[1] J. Berant, V. Srikumar, P. Chen, A. Linden, B. Harding, B. Huang, P. Clark, and C. Manning, “Modeling biological processes for reading comprehension”, In EMNLP, 2014.

[2] M. Richardson, C. Burges, and E. Renshaw, “Mctest: A challenge dataset for the open-domain machine comprehension of text”, In EMNLP, volume 3, pp. 4, 2013.

[3] T. Jurczyk, M.l Zhai, J. Choi, “SelQA: A New Benchmark for Selection-Based Question Answering”, ICTAI, 2016.

[4] Y. Yang, W.-t. Yih, and C. Meek, “WIKIQA: A Challenge Dataset for Open-Domain Question Answering,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP’15, 2015, pp. 2013–2018.

[5] C. N. d. Santos, M. Tan, B. Xiang, and B. Zhou, “Attentive pooling networks” CoRR, vol. abs/1602.03609, 2016.