Student Performance Prediction

Status	IN PROGRESS
Stakeholders	Kostis Pristouris, George Papastefanatos, Abi James, Vinay, Dejan Paunović
Outcome
Due date
Owner	Nikola Tomasevic
Related JIRA task	SWIK-1405 - Getting issue details... STATUS SWIK-1701 - Getting issue details... STATUS

Background

The module for student performance prediction is aimed to provide a generic solution for predicting the student performance on SlideWiki Questions and Exams (potentially could be applied on data from external LMSs as well).

As part of preliminary analysis (in Matlab), a number of ML techniques (k-NN, SVM, ANN, decision trees, naïve Bayes, matrix factorization, etc.) were evaluated and compared using Open University Learning Analytics Dataset (OULAD). All techniques were evaluated, both for classification and regression tasks of predicting the student performance, based on relevant quantitative parameters (such as recall, precision, average F1, average RMSE, computational requirements, prediction time, etc.). Furthermore, different combinations of input data (in terms of demographic (D), engagement (E) and performance (P) data) were investigated in order to discover the optimal input data set for predictions. A digested overview of the comprehensive analysis is shown below (confidential - to be published):

Based on the preliminary analysis and problem specification, student performance prediction will be performed in two ways:

Student performance based approach (Regression), and
Student engagement based approach (Classification).

1. STUDENT PERFORMANCE BASED APPROACH

As part of performance based approach, the predictions will be made based on the past student performance using Matrix factorization technique. This will be a regression task (prediction of continues variable), as predicted value will represent the exam score (e.g. score 1-100).

Input data will be obtained from LRS and provided in form of a student/exam matrix. Creation of input matrix will follow these steps:

Performance prediction should be performed for student A, on exam X
Get (from LRS) all exams (exam IDs with scores) that student A took in the past (let’s say exams Y, W and Z)
Get all other students (and their respective exam scores) that already took the exam X in the past (let’s say students B, C and D)
Get the exam scores for students B, C and D on exams Y, W and Z (if they took them; leave empty cell otherwise)

Finally, we’ll have a sparse student/exam matrix (partially filled as different students could take different exams), with the last column and last raw completely filled, except the last cell value (to be predicted):

By running the Matrix factorization, the input matrix will be decomposed and missing cell values will be estimated, whereby the last element of the last row will represent the predicted value (i.e. the exam score).

2. STUDENT ENGAGEMENT BASED APPROACH

As part of student engagement based approach, the predictions will be made based on the past student engagement (potentially in combination with student demographic data) using decision trees technique. This will be a classification task (prediction of discrete variable), as the predicted value will represent the exam result in form of two categories (pass or fail).

Input data will be obtained from LRS and provided in form of student/engagement matrix. Creation of input matrix will follow these steps:

Performance prediction should be made for student A, on exam X
Get (from LRS) all exams (exam IDs with respective engagement data) that student A took in the past (let’s say exams Y, W and Z)
Get all other students (and their respective exam results – pass/fail) that already took the exam X in the past (let’s say students B, C and D)
Get the engagement data for students B, C and D, for exams Y, W and Z, i.e. respected decks (value 0 in case they did not have any engagement)

Finally, we’ll have a completely filled student/engagement matrix, except the missing value in the last cell (to be predicted):

By deploying the Decision trees upon the input matrix, the missing value will be estimated representing the prediction whether the student will fail or pass the exam.

Both approaches were already implemented using Spark MLLib and validated upon the OULAD dataset. Bearing in mind that prediction upon OULAD database, which covered roughly 1.000 students and 7 learning modules, needed in average 0.7 sec/student using matrix factorization and 1.9 sec/student using decision trees (on a DELL laptop with Intel Core i7-4500U CPU@1.80GHz and 16.0GB RAM), performance predictions should be performed offline (in the background). Running these algorithms upon a larger database (that SlideWiki will eventually have) will take even more time. Certainly, adequate data/time constraints and reduction of problem dimensioning will be taken into account in the final phase of development.

Student-related features

The following is the list of student-related features potentially relevant for predicting the student performance:

STUDENT PERFORMANCE	1.	Past exam results (grades)
STUDENT ENGAGEMENT	1.	List of visited decks (number and duration, slide clicks)
	2.	List of commented decks (number of comments)
	3.	Number of exam attempts
	4.	List of shared decks (number of sharing)
	5.	List of printed/downloaded decks
	6.	List of liked decks
	7.	List of rated decks (with given rates)
STUDENT DEMOGRAPHIC & CONTEXTUAL DATA	1.	User location (country)
	2.	Date and time of record
	3.	User age and skills, user groups, deck contributors…

DISCUSSION POINTS & IDEAS:

1. PREDICTION MODULE DEPLOYMENT – Integration of prediction module with other components in the loop

In order to fetch the input data (such as past exam scores from the LRS, or user engagement data from the activity service (or LRS?), and the user profile data), will there be an intermediary component for data acquisition (at least for the LRS)?
The status of the LRS, intermediary component for data acquisition (if it was planned), or any other component you might see as necessary?
Deployment of the prediction module as a microservice, using Docker? Assistance would be most appreciated in this regard :)

2. INPUT DATA FORMAT – How and where the input data will be stored

The result of SlideWiki Question and Exams will be stored in the LRS? Will the range of the exam scores be unified (e.g. from 1-100) or not? Will there be a unique threshold for passing the exam (such as 40%)? Is it planned that each deck has its dedicated exam or only certain ones?
Who will be responsible for the Questions and Exams module from no one (since @Vinay is leaving the project)?
What kind of data will the LRS keep (only exam results, or user activity data as well)?

3. UI VISUALIZATION – In which form the prediction results should be visualized

Introduction of a prediction tab (as part of the (1) user profile or the (2) content modules panel; perhaps, preferable option (1)) where the prediction could be initiated (for the active deck/exam), and where the results (once they are available) could be visualized: in form of tips (“Comparing to the activity of other SlideWiki users, we advise you to…”) or dedicated graphs.
Could be implemented from the perspective of a student (where a student can analyze predictions for him/herself), as well as from the perspective of a professor (where a professor can investigate predictions for the specified student and exam e.g. from a dropdown list)

Please, comment and share ideas. Many thanks!

Student Performance Prediction

Background

Related content