Crowdsourced question-and-answer website Quora has published a dataset of different questions users ask on its website that could be similar to each other to spur the development of algorithms that can help curate redundant questions. For example, the questions “Should I learn Python or Java first?” and “If I had to choose between learning Java and Python, what should I choose to learn first?” are semantically equivalent—though they use different words and structures, they are asking the same thing—but identifying this can be challenging to automate, causing users to frequently post the same questions on the website without realizing there is already an answer available. The dataset contains 400,000 lines of these potentially similar questions and indicates whether or not they are semantically similar, so developers could build natural language processing systems that could help Quora redirect users to posts where their questions have already been answered.
Learning to Understand What People Ask on the Internet
previous post