Objective and Contribution

Presented MOOCCube, a large multi-dimensional data repository of over 700 MOOC courses, 100K concepts, and 8 million student behaviours with external resources. We performed an initial prerequisite relation discovery task to showcase the potential of the MOOCCube and we hope that this data repository will open up many NLP and AI research for education applications such as course concept extraction.


How is MOOCCube different from existing education datasets?
  1. Large multi-dimensional dataset. MOOCCube covers 700 courses, 38K videos, 200K students, and 100K concepts with 300K relation instances

  2. High-coverage. MOOCCube covers all the attributes and relationships as the data are obtained from real MOOC websites. As shown below, a data cell of MOOCCube is in terms of courses, concepts, and students, representing a student s learns concept k in course c. This allows MOOCCube to provide different combinations of these data cells

  3. The MOOCCube can be use to build datasets for different tasks such as dropout prediction and concept extraction where previously these are two different datasets

Dataset Collection

The MOOCCube is broken down to three main dimensions:

  1. Courses

  2. Concepts

  3. Students

Course Extraction

Courses are series of pre-recorded videos and for each course, we extracted the synopsis, video list, teacher, and the organisation. We extracted the video order and subtitles and record detail description of the teacher and organisation using Wikidata.

Concept and Concept Graph

In this dimension, we aim to extract knowledge concepts taught in course videos. For each video, we extract 10 most representative course concepts from subtitles. For each concept, we record the concept description using Wikidata and search for the top 10 related papers using AMiner. Lastly, we build a novel concept taxonomy with prerequisite chains as concept graph to capture the semantic relationships between concepts. The prerequisite chain is formulated where if concept A helps understanding concept B, then concept A has a prerequisite relation to concept B. To build this prerequisite chains:

  1. Reduce amount of candidate concept pairs using taxonomy information and video dependency

  2. Manual annotation and the annotated labels are used to train different models to further build a larger distant supervised prerequisite dataset

Student Behaviour

This dimension tends to support research in course recommendation, video navigation, dropout prediction, and relationships between courses and concepts. Here, we preserve the enrolment records and video watch logs of almost 200K users from 2017 – 2019. The video watch logs consist of student behaviour while watching the video such as common video point, click on a certain sentence, etc. We anonymised the users with UserIDs.

Data Analysis

The figure below compared our MOOCCube dataset with other education dataset. Our MOOCCube has the largest data size, bigger than previous education dataset by a wide margin in different dimensions, especially the concept graph dimension. In addition, our MOOCCube covers all the different types of data in the MOOC environment which it’s contrast to previous education dataset that either covers the student behaviour or course content.

The figures below showcase our concept distribution and the course distribution of enrolled users. Overall we categorised concepts into 24 domains. Our dataset have more concepts in engineering courses than natural sciences. In figure 3, it shows that 451 courses are enrolled with more than 100 users and more than 70% of users have watched more than 10 videos.


We performed prerequisite relation discovery as an example application of using the MOOCCube. This is the task that aims to answer the question of “what should someone learn first”. We reproduce different methods using the MOOCCube dataset and the results are showcase below PREREQs performed the best F1-score and we believe that the high-coverage of our MOOCCube data helps to discover the prerequisite relationships. In addition, two of our baselines PCNN and PRNN produced competitive results, showcasing the effectiveness of our dataset.

Conclusion and Future Work

The potential future work could be to a) utilise more data types from MOOCCube to facilitate existing topics, b) employ advanced models for existing tasks, and c) discover more innovative NLP applications within the online education domain.



Data Scientist

Leave a Reply