Relation extraction is “annotation hungry” since each relation needs to have its own labelled data. A potential fix is to use existing knowledge bases such as DBPedia. However, this only provides distant supervision, Distant supervision is when a classifier is trained on weakly labelled dataset, where the training dataset is probably labelled automatically based on some rules.
Treat entity pair as the instance
One solution is to treat entity pairs as the instance and features can be aggregated by collecting all the sentences in which both entities are mentioned. We can get the relation labels between the entity pairs from knowledge base. Negative instances are also constructed from entity pairs that are not in the knowledge database. This labelled dataset is illustrated below:
Multiple instance learning
In multiple instance learning, labels are assigned to sets of instances, of which only an unknown subset are actually relevant. The framework of distant supervision is that the relation REL(A, B) act as the label for the entire set of sentences that contains both entities A and B, even when subset of these sentences do not describe the relation.
One approach to multiple instance learning is to use binary latent variable for each sentence, to indicate whether the sentence relates to the labelled relation. This is in the field of probabilistic model for relation extraction and so a variety of inference techniques can be apply. For example, we could use expectation maximisation, sampling, and custom graph-based algorithm.