How is Diffbot KG built?

There are few key machine learning components in the KG pipeline such as:

  1. Page type classification

  2. Visual extraction

  3. Natural language understanding

  4. Record linking

Once pages are crawled (using Selenium to get all the information on the page), we would perform page type classification. This is where we classify what the page is about, whether that be about products, events, articles, people, languages, and so on. Once we classified the page, we would perform visual extraction to extract all the images and attributes related to the images. For example, if the page has been classified as a product page, we would extract all the product images as well as the name of the product, the price, and other attributes that are related to that product. Next comes natural language understanding. We would perform NER, relation extraction, coreference resolution, and entity linking. The last piece is data integration where we fuse all the data from different sources together to ensure facts are consistent and accurate.

How can we develop a highly reusable knowledge graph?

The common problem is that knowledge graphs can be difficult to reuse without a schema or difficult to adapt outside of their original use case. Schema is important for a knowledge graph. The properties of a good KG schema:

  1. Can incorporate contextual information about events

  2. Easy to adapt and maintain

  3. Human readable documentation

  4. Rigorous definitions

This leads us to modular ontologies. A modularly structured schema allows us to better adapt and evolve when receiving new data. The overall methodology is as follows:

  1. Define scope of use cases

  2. Come up with key questions while looking at different data sources and continue to scopre the problem

  3. Identify key notions from data sources and use cases and identify the relevant pattern to use for each case

  4. Instantiate these key notions from pattern templates and adapt your results as needed

  5. Systematically add axioms for each module

  6. Assemble the modules and add axioms which involve several modules

  7. Check and improve all entity names. Check module axioms to see if they are still appropriate

  8. Create OWL files (schema)



Data Scientist

Leave a Reply