Google has created a dataset of phrases to train patent search models. Many patent owners use non-standard language to describe their patents’ subject, such as describing a soccer ball as a spherical recreation device, which can result in widely varied and impractical search returns. To help train search models, the dataset contains approximately 50,000 phrase-to-phrase pairs, with labels denoting how phrases are related to one another, such as a synonym, exact match, or unrelated.
Image credit: Flickr user Nick Normal