This document describes research on using deep learning to predict student performance in massive open online courses (MOOCs). It introduces GritNet, a model that takes raw student activity data as input and predicts outcomes like course graduation without feature engineering. GritNet outperforms baselines by more than 5% in predicting graduation. The document also describes how GritNet can be adapted in an unsupervised way to new courses using pseudo-labels, improving predictions in the first few weeks. Overall, GritNet is presented as the state-of-the-art for student prediction and can be transferred across courses without labels.
2. The return of the MOOC
● Education is no longer a one-time event but
a lifelong experience
● Working lives are now so lengthy and
fast-changing
● People must be able to acquire new skill
throughout their careers
● MOOC is broadening access to cutting-edge
vocational subjects
Backgrounds
Hak Kim |
3. Challenges
● Significant increase in student numbers
● Impracticality to assess each student’s
level of engagement by human instructors
● Increasingly fast dev and update cycle of
course contents
● Diverse demographics of students in each
online classroom
In the World of MOOCs
Hak Kim |
4. ● Challenging problem to forecast the future
performance of students as they interact
with online coursework
● Reliable early-stage prediction of a student’s
future performance is critical to facilitate
timely (educational) interventions during a
course
Only a few prior works have explored the problem
from a deep learning perspective!
Student Perf. Prediction
Hak Kim |
6. Graduate (user 1122)
● First views four lecture videos, then gets a quiz correct
● In the subsequent 13 events, watches a series of videos, answers quizzes
and submits first project three times and finally get it passed
Drop-out (user 2214)
● Views four lectures sparsely and fails the first project three times in a row
With this raw student activity data, GritNet can make a prediction as to
whether or not student would graduate (or any other student outcomes)
Student Event Sequence
Hak Kim |
7. Input
● Feed time-stamped students’ raw event sequence
(w/o engineered features)
● Encode them by one-hot encoding of possible event tuple
{action, timestamp} - to reserve ordering information and to
capture student’s learning speed
Output
● Probability of a student performance
● E.g., graduation or any other student outcomes
Model
● GMP layer is able to capture most relevant part of the input
event sequence and to deal with imbalanced nature of data!
GritNet
Hak Kim |
8. Training objective is the negative log likelihood of the observed
event sequence of student activities
● Binary cross entropy is minimized using stochastic gradient descent on mini-batches (size of 32)
Model optimization
● BLSTM layers of 128 cell dimensions per direction
● Embedding layer dimension grid-searched
● Dropout applied to the BLSTM output
● Note: Baseline model (logistic regression based) was also optimized extensively
Trained a different model for different weeks based on student’s week-by-week event records
GritNet Training
Hak Kim |
9. On Udacity Data
● ND-A program: 777 graduated out of 1,853 students (41.9%)
● ND-B program: 1,005 graduated from 8,301 students (12.1%)
Results
● Compared student graduation prediction accuracy in terms of
mean area under the ROC curve (AUC) over 5-fold cross validation
● GritNet consistently outperforms the baseline and provides
superior performance by more than 5% abs on both ND-{A,B}
dataset
● On ND-B dataset, the baseline model requires a wait of two
months to reach the same performance that the GritNet shows
within three weeks
I.e., improvements of GritNet are substantially pronounced in the first few
weeks when accurate predictions are most challenging!
Prediction Performance
5.3% abs (7.7% rel)
5% abs
Hak Kim |
10. Intermission
GritNet is the current state of the art for student prediction problem
● P1: Does not need any feature engineering (learns from raw input)
● P2: Can operate on any student event data associated with a timestamp
(even when highly imbalanced)
Hak Kim |
11. Remaining challenges
● Significant increase in student numbers
● Impracticality to assess each student’s
level of engagement by human instructors
● Increasingly fast dev and update cycle of
course contents
● Diverse demographics of students in each
online classroom
In the World of MOOCs
Hak Kim |
12. What’s Next
● So far GritNet trained and tested on the same course
data
● Real-time prediction during on-going course
(before the course finishes)
Hak Kim |
14. Observation
● GMP layer output captures generalizable sequential
representation of input event sequence
● FC layer learns the course-specific features
GritNet can be viewed as a sequence-level feature (or
embedding) extractor plus a classifier
Intuition
Hak Kim |
15. Unsupervised DA framework
● Trained from previous (source) course
but to be deployed on another (target) course
● E.g., v1 → v2, existing → newly-launched ND program
Ordinal Input Encoding
● Group (raw) content IDs into subcategories
● Normalize them using natural ordering implicit
in the content paths within each category
Domain Adaptation
Hak Kim |
16. Domain Adaptation
Hak Kim |
Unsupervised DA framework
● Trained from previous (source) course
but to be deployed on another (target) course
● E.g., v1 → v2, existing → newly-launched ND program
Ordinal Input Encoding
● Group (raw) content IDs into subcategories
● Normalize them using natural ordering implicit
in the content paths within each category
17. Pseudo-Labels generation
● No target labels from the target course
● Use a trained GritNet to assign hard
one-hot labels (Step 4)
Domain Adaptation
(cont’d)
Hak Kim |
18. Transfer Learning
● Continue training the last FC layer on the
target course while keeping the GritNet
core fixed (Step 6 and Step 7)
● The last FC layer limits the number of
params to learn
Very practical solution for faster adaptation time and applicability for even a smaller size target course!
Domain Adaptation
(cont’d)
Hak Kim |
19. Udacity Data
Setup
● Vanilla baseline: LR trained on source course, evaluated on target course
● GritNet baseline: GritNet trained on source course, evaluated on target course
○ BLSTM layers of 256 cell dimensions per direction and embedding layer of 512 dimension
● Domain adaptation
● Oracle: use oracle target labels instead of pseudo-labels
Experimental Setup
Hak Kim |
20. Generalization Performance
Hak Kim |
70.60% ARR
58.06% ARR
Up to 84.88% ARR at week 3Up to 75.07% ARR at week 3
Results
● AUC Recovery Rate (ARR): AUC reduction that unsupervised domain adaptation provides divided by
the absolute reduction that oracle training produces
○ Clear wins of GritNet baseline vs. vanilla baseline
○ Domain adaptation enhances real-time predictions explicitly in the first few weeks!
21. Summary
GritNet is the current state of the art for student prediction problem
● P1: Does not need any feature engineering (learns from raw input)
● P2: Can operate on any student event data associated with a timestamp
(even when highly imbalanced)
● P3: Can be transferred to new courses without any (students’ outcome) label!
Full papers available on arXiv (GritNet, GritNet2)
Hak Kim |
22. Looking ahead
Is a student likely to benefit from (sequential) interventions?
Student, X Intervention, T
Learning outcomes, Y
? ... ...
... ...
Hak Kim |