Objective and Contribution

Extended the Dr. Inventor corpus with argumentative components and relations annotations and conducted an annotation study. The goal here is to understand the different arguments within the scientific text and how they are link together. We performed analysis on the annotated argumentations and explore the relations between argumentation that exists within the scientific writing. The contributions are as follows:

  1. Proposed a general argumentative annotation scheme for scientific text that covers different research domains

  2. Extended Dr. Inventor corpus with argumentative components and relations annotations

  3. Conducted analysis on the information-theoretic of the corpus

Annotation Scheme

There are many theoretical frameworks for argumentation and we initially use the Toulmin model for its simplicity and relevant to AI and argument mining. The Toulmin model has 6 types of argumentative components: claim, data, warrant, backing, qualifier, and rebuttal. However, after initial annotations, we realised that not all components exists. Therefore we simplify our annotations scheme to the following three argumentative components:

  1. Own Claim. Argumentative statement that relates to the author’s work

  2. Background Claim. Argumentative statement that relates to works that are related to the author’s work

  3. Data Component. Facts that support or against a claim. This includes references and facts with examples

With those argumentative components set, we introduced the following three relations types:

  1. Supports. This relation holds between two components if the factual accuracy of one component increases with the other

  2. Contradicts. This relation holds between two components if the factual accuracy of one component decreases with the other

  3. Semantically Same. This relation captures claims or data component that are semantically the same. This is similar to argument coreference and / or event coreference

Annotation Study

We performed an annotation study of the Dr. Inventor corpus and extended the dataset. The Dr. Inventor corpus has four layers of rhetorical annotations with sub-labels as shown below:

  1. Discourse Role

  2. Citation Purpose

  3. Subjective Aspects

  4. Summarisation Relevance

The annotation process consists of one expert and three non-experts annotators. The annotators are trained in a calibration phase where all annotators annotate one publication together. We computed the inter-annotator agreement (IAA) for each iteration and discuss any disagreements. The figure below showcase the IAA score progression across 5 iterations. There are two versions: strict and weak. Strict version required entities to be exact match in span and type and relations to be exact match in both components, direction and relation type. Weak version requires match in type and only overlap in span. The agreement (IAA) increases with iterations as expected. In addition, the agreement on relations are lower as that’s usually a lot more subjective, not to mention the agreement on relations are influenced by the agreement on components.

Corpus Analysis

Argumentation annotations analysis

Table 2 showcase the summary statistics of each argumentative components and relations in the Dr. Inventor corpus. There are approx. 2x the number of own claims than background claims which it’s as expected as the corpus consists of original research papers. In addition, data components are only half as many as claims. This could due to the fact that not all claims are supported or claims can be supported by other claims. Naturally, there are a lot of supports relations as authors tend to strengthen their claims by supporting it with data components or other claims. Table 3 showcase the length of argumentative components. Both own and background claims are of similar length whereas data components are half the length. This could be attributed to the fact that in computer science, explanation tend to be shorted and also most often, authors would just refer to tables and figures for supports.

The argument structure of a scientific paper follows the directed acyclic graph (DAG) where argumentative components are the nodes and the edges are the relations. Table 4 below showcase graph analysis of the DAG of the argument structure of scientific paper. There are 27 standalone claims and 39 unsupported claims. The max in-degree showcasing the maximum connections there are between nodes. An average of 6 tells us that there are lots of claims with strong supporting evidence provided. We also ran PageRank algorithm to identify the most important claims and listed some examples in Table 5. Results showcase that majority of the highest ranked claim comes from the background claim, telling us that in the computer graphics papers, they tend to put more emphasis on research gaps for their motivation of work rather than on empirical results.

Connections to other rhetorical aspects

How well does our new argumentative components connect with existing annotations in the Dr. Inventor corpus? In table 6 below, we showcase the normalised mutual information (NMI), which measures the amount of shared information between the five annotation layers. We showcase the NMI scores for all the annotation pairs:

  1. Argument Components (AC)

  2. Discourse Roles (DR)

  3. Subjective Aspects (SA)

  4. Summarisation Relevances (SR)

  5. Citation Contexts (CC)

There’s a strong NMI score between AC and DR, which makes sense as background claims are likely to be found in the discourse role background section. Another high NMI score is between AC and CC. This makes sense as citations are often referenced in background claims.

Conclusion and Future Work

We created the first argument-annotated corpus of scientific papers and provided key summary statistics of the corpus and argumentative analysis. Potential future work could involve extending the corpus with papers from other domains and further develop the models to analyse scientific writing.

Ryan

Ryan

Data Scientist

Leave a Reply