Event Extraction

This is the task of extracting events from unstructured text and capturing structured information about the events such as who did what, when did the event happen, where, and so on. Event extraction involves extracting entities and relationships between those entities. Event extraction also has the problem of being domain-specific and it’s considered to be the hardest problem among named entity recognition, relation extraction, and event extraction.

For geopolitical style events, event extraction has been studied under ACE ontology and other alternatives such as CAMEO and ICEWS. For the biomedical domain, the BioNLP tasks are popular. Some work has been developed to reduce the complexity of the task. For example, training a sequence of event classifiers to extract event triggers, of which then work out the event arguments. In addition, joint extraction methods is currently the SOTA method, where it extracts both the event triggers and arguments. This makes sense as events and entities tend to be closely related where entities tend to be the arguments of the event. The approach is to model dependencies between variables of events, entities, and their relations and to perform joint inferences of these variables across documents. A limitation of joint information extraction is that it suffers from complexity issues.

In summary, the performance of event extraction is still relatively poor and a potential exploration is to exploit document-level contexts.

Web Information Extraction

Given a domain, find and crawl relevant pages to perform domain discovery. To construct a domain-specific KGs, we would still need to perform NER, relation, and event extractions on those information.

There’s a dominant technique for web information extraction known as “wrapper”, where the wrapper would use a single uniform query to access multiple sources of information. The wrapper uses a set of extraction rules to perform pattern matching. Due to the rule-based approach, designing and tailoring wrappers for a specific domain could be costly and so research has move towards designing a trainable wrapper system.

Wrapper system can be build using supervised, semi-supervised, or unsupervised techniques. Supervised approach generally has a set of web pages mapped with extracted data and train a wrapper with the dataset. Semi-supervised aims to alleviate the cost of labelling dataset by taking in a set of examples from users for extraction rule generation. Needless to say, unsupervised approach do not use any labelled dataset or user interaction to generate a wrapper. Unsupervised approach extract targets by segmenting data into different data-rich regions of the page.



Data Scientist

Leave a Reply