The dataset is called MS MARCO, which stands for Microsoft MAchine Reading COmprehension, and the team behind it says it’s the most useful dataset of its kind because it is based on anonymized real-world data.
By making it broadly available for free to researchers, the team is hoping to spur the kind of breakthroughs in machine reading that are already happening in image and speech recognition. They also hope to facilitate the kind of advances that could lead to the long-term goal of ‘artificial general intelligence,’ or machines that can think like humans.
“In order to move towards artificial general intelligence, we need to take a step towards being able to read a document and understand it as well as a person,” said Rangan Majumder, a partner group program manager with Microsoft’s Bing search engine division who is leading the effort.”
Majumder believes that systems to answer sophisticated questions are still in their infancy. Search engines like Bing and virtual assistants like Cortana can answer basic questions, like “What day does Hanukkah start?” or “What’s 2,000 times 43?”
But in many cases, Majumder said search engines and virtual assistants will instead point the user to a set of search engine results. Users can still get the information they need, but it requires culling through the results and finding the answer on the web page.
In order to make automated question-and-answer systems better, researchers need a strong source of what is called training data. These datasets can be used to teach artificial intelligence systems to recognize questions and formulate answers and, eventually, to create systems that can come up with their own answers based on unique questions they haven’t seen before.
Majumder and his team – which includes Microsoft researchers and people working on Microsoft products – say the MS MARCO dataset is particularly useful because the questions are based on real, anonymized queries from Microsoft’s Bing search engine and Cortana virtual assistant. The team chose the anonymized questions based on the queries they thought would be more interesting to researchers. In addition, the answers were written by humans, based on real web pages, and verified for accuracy.
By providing realistic questions and answers, the researchers say they can train systems to better deal with the nuances and complexities of questions regular people actually ask – including those queries that have no clear answer or multiple possible answers.
Li Deng, partner research manager of Microsoft’s Deep Learning Technology Center, said previous datasets were designed with certain limitations, or constraints. That made it easier for researchers to create solutions that could be formulated as what machine learning researchers call “classification problems,” rather than by seeking to understand that actual text of the question.
He also said that MS MARCO is designed so that researchers can experiment with more advanced deep learning models designed to push artificial intelligence research further forward.
“Our dataset is designed not only using real-world data but also removing such constraints so that the new-generation deep learning models can understand the data first before they answer questions,” Deng added.
Majumder said the ability for systems to answer complex questions could augment human abilities by helping people get information more efficiently.