Objective and Contribution

Proposed an iterative data augmentation technique to generate synthetic data using real German language summarisation data. Subsequently, they built an abstractive summarisation model for the German language using Transformer. This paper tackles the low resource challenge that exists in NLP tasks for other non-English languages. The data augmentation improves the performance of abstractive summarisation model when compared to models trained without data augmentation.

Methodology

The transformer model is implemented using OpenNMT-py. The figure above showcase our dataset summary statistics. We used two German wiki datasets, namely, SwissText 2019 and Common Crawl. SwissText 2019 is used as the real data whereas Common Crawl is used as synthetic data. The real data SwissText 2019 (100K data) is divided into train, val, and test ratio of 90:5:5. For Common Crawl, below are the following steps to generate synthetic data:

  1. Build vocabulary of the most frequent German words using SwissText dataset

  2. Sentence selection from the Common Crawl dataset based on the vocabulary and threshold. For example, a sentence has 20 words and threshold is 10%, that sentence will only be selected if it has at least 2 out of 20 words in our vocabulary

  3. Random selection of sentences from step 2

  4. The 100K selected sentences are used as summary and we need to build a model to generate corresponding input text. Therefore, we used the summary as the input and the target is the text. This is the reverse trained model as shown below. The final overall dataset is 190K

Lastly, in order to improve the quality of synthetic data, we used an iterative approach as shown below. We would first train our transformer model using the real and synthetic data. We would then used the trained transformer model to regenerate our synthetic data to train our final transformer model.

Experimental Setup and Results

There are three experiment settings as follows:

  1. S1. Using only real data (90K) to train our transformer model. This is the baseline

  2. S2. Using real and synthetic data (190K) to train our transformer model

  3. S3. Using the real and regenerated synthetic data to train our transformer model

Results

Model S2 performed the best despite effort to improve quality of synthetic data in S3. As shown in the figure below, the ROUGE score development reached its peak at early iteration, showcasing good results can be reached with quicker training process. Model S2 tends to have more variance in word and phrase generation as well as summary length when compared to S1. The average summary length for S2 is 41.42 vs 39.81 for S1.

Conclusion and Future Work

Potential future work could involve further investigation in synthetic summarisation data and utilising transfer learning on text summarisation for non-English language that has low resource data.

Ryan

Ryan

Data Scientist

Leave a Reply