Information extraction is a main component in any knowledge graph pipeline. The goal is to extract information from raw data. This is an active area as NLP techniques are still not good at understanding natural languages due to subtlety, ambiguity, and irregularity. The early goal of information extraction was to extract key information such as the entities, relations, events, and attributes.
Generally, an information extraction systems are constrained by an underlying ontology although in recent years, Open information extraction has gain popularity. There are many techniques being researched and explored today, which includes rule-based, conditional random fields (CRFs), and deep neural networks.
The challenges of information extraction
Lack of labelled data – SOTA systems tend to be supervised models and labelling data tends to be expensive and time-consuming
Many techniques are effective only in a “generic” setting – To apply information extraction techniques to domain-specific areas, you would need to custom train different components of the information extraction pipeline. For example, you would need to custom train an NER model to pick up financial entities
The format and heterogeneity of the raw data can represents challenge when sharing results across research groups – Where are we extracting our information from? Depending on the source of the raw data, there will be different ways of processing these information
The scope of information extraction tasks
Due to the fact that information extraction is such a broad area, we must define the scope of the problem we are trying to solve. The approach of the Web information extraction and the NLP-centric information extraction are very different in terms of how to extract entities, relations, and events. The information extraction tasks involve:
Named Entity Recognition
Web Information Extraction
For example, to build a knowledge graph from blog articles, one might build a Web information extraction system to scrape online data and obtain text data. Only once we have the text data will we be able to proceed to other information extraction tasks.