This study introduces a machine learning framework to predict and analyze student performance by examining demographic, socioeconomic, and behavioral factors, focusing on underdeveloped countries such as Nepal. Our model leverages multiple classification algorithms, including Decision Tree, Random Forest, and SVM, to predict academic performance based on factors like parental support, study time, and extracurricular activities. Using a dataset of 2,392 students, the Random Forest model achieved the highest accuracy (93.95%), identifying key factors influencing academic success. Practical applications of this research support educational institutions, EdTech platforms, and policymakers in data-driven approaches to improve academic outcomes. Experimental details, including hyperparameters, computational settings, and evaluation methods, are provided for reproducibility.
—
### **Introduction**
#### Background and Motivation
Underdeveloped countries face unique educational challenges due to socio-economic and infrastructural constraints. In regions like Nepal, understanding factors that influence student performance can lead to targeted interventions. Machine learning offers a powerful approach to assess students’ academic performance, allowing educators and policymakers to make data-driven decisions. This research aims to bridge this gap by developing a predictive model for academic success and identifying key contributing factors.
#### Research Objectives
The main objectives of this study are to:
1. Build a machine learning model to predict academic performance using relevant features.
2. Examine the effects of parental support, study habits, and extracurricular activities on student performance.
3. Provide actionable insights for educational institutions, EdTech platforms, and policymakers to improve academic outcomes in underdeveloped countries.
#### Research Scope
Focusing on secondary school students in Nepal, this study examines diverse metrics, including weekly study time, parental support, and extracurricular involvement, with the aim of deriving insights applicable to broader socio-economic settings.
—
# Literature Review/Related Works
Regional and Global Insights on Student Performance Prediction
Studies on student performance prediction in high-income countries have revealed a variety of factors influencing academic success, from socio-economic status to parental education and extracurricular involvement. However, research in underdeveloped countries remains sparse, with few studies addressing how specific cultural and socio-economic contexts impact learning. The limited availability of data-driven approaches in these regions underscores the need for this research, as it highlights key insights into the factors hindering student performance.
Machine Learning in Educational Assessment
Machine learning techniques have shown substantial promise in educational data applications, offering models that can analyze and interpret complex student performance indicators. Decision Trees, Random Forests, and Support Vector Machines (SVM) are among the most effective models in handling non-linear relationships, which are common in educational data. Random Forests, in particular, excel in predictive accuracy and feature importance, making them suitable for identifying key performance drivers and offering actionable insights.
—
### **Dataset Description**
#### Dataset Summary
The dataset includes 2,392 records of secondary school students. This includes demographic, academic, and behavioral metrics, as well as parental and extracurricular involvement indicators. The dataset was split into 80% training, 20% testing sets, ensuring adequate data for model evaluation and hyperparameter tuning. All relevant statistics are presented below.
Demographic Details
Age: 15-18 years.
Gender: 0 for Male, 1 for Female.
Ethnicity: 0: Caucasian, 1: African American, 2: Asian, 3: Other.
Parental Education: 0: None, 1: High School, 2: Some College, 3: Bachelor’s, 4: Higher.
Academic and Behavioral Metrics
StudyTimeWeekly: Weekly study time in hours (0-20).
Absences: Number of absences during the school year (0-30).
Tutoring: 0 for No, 1 for Yes.
Parental and Extracurricular Involvement
ParentalSupport: Coded from 0 (None) to 4 (Very High).
Extracurricular, Sports, Music, Volunteering: 0 for No, 1 for Yes.
Target Variable: Grade Class
GradeClass: Coded from 0 (‘A’) to 4 (‘F’), based on GPA range (2.0–4.0).
**Table 1: Dataset Summary**
| Feature | Description | Data Type | Range or Categories |
|———————-|——————————————————|—————|———————————|
| Age | Age of student | Continuous | 15–18 |
| Gender | Student gender (0=Male, 1=Female) | Categorical | 0, 1 |
| Ethnicity | Student ethnicity | Categorical | 0: Caucasian, 1: African American, 2: Asian, 3: Other |
| Parental Education | Education level of parents | Ordinal | 0: None to 4: Higher |
| StudyTimeWeekly | Weekly study time in hours | Continuous | 0–20 |
| Absences | Number of absences during the school year | Continuous | 0–30 |
| GPA | Grade Point Average (2.0–4.0 scale) | Continuous | 2.0–4.0 |
| GradeClass | Target variable, classified based on GPA | Ordinal | 0: A to 4: F |
**Preprocessing Steps:**
1. **Exclusions**: Data columns like StudentID, Age, Gender, Ethnicity, and Volunteering were excluded due to weak correlation with the target variable.
2. **Data Standardization**: All continuous variables were standardized.
3. **Encoding**: Categorical variables were one-hot encoded where necessary.
#### Data Collection and Access
The dataset is accessible for download [here (insert link)]. Data was collected following strict quality control measures, with annotators ensuring accuracy across all demographic and behavioral attributes.
—
### **Exploratory Data Analysis**
**Figure 1: Grade Distribution Pie Chart**
Displays the distribution of grades across students, providing a clear understanding of the target class spread.
**Demographic Analysis**
– **Figure 2a**: Gender Distribution
– **Figure 2b**: Age Distribution
– **Figure 2c**: Ethnicity Distribution
**Parental Involvement Analysis**
**Figure 3**: Influence of Parental Support on Grade Class.
Highlights how different levels of parental support correlate with grade distribution.
**Study Habits Analysis**
– **Figure 4a**: Weekly Study Time by Grade Class
– **Figure 4b**: Absences by Grade Class
**Extracurricular Activities Analysis**
– **Figure 5a**: Impact of Extracurricular Participation on GPA
– **Figure 5b-5d**: Impacts of Sports, Music, and Volunteering on GPA
**Correlation Analysis**
**Figure 6: Correlation Heatmap**
This heatmap highlights relationships among variables, showing that study time, absences, and parental support are more correlated with GPA and Grade Class, while demographic factors (e.g., Age, Gender) have minimal impact.
Key insights from the correlation heatmap:
Academic Performance Factors:
StudyTime, Absences, and Parental Support are main factors that impact Academic Performance of Student
Extracurricular Activities:
Music, Sports, and Extracurricular activities show very weak correlations with academic performance (GPA and GradeClass)
Volunteering shows almost no correlation with academic performance (near zero correlations)
Demographic Factors:
Age, Gender, Ethnicity, show very weak or negligible correlations with academic performance
This suggests that demographic factors don’t strongly influence academic outcomes in this dataset
—
### **Methodology**
Data Collection and Preprocessing
The dataset comprises 2,392 student records, covering demographic and academic metrics relevant to student performance. Key features include weekly study time, parental education level, absences, and extracurricular participation, which are preprocessed for consistency. We standardized numerical variables to enhance model performance and removed any inconsistencies to maintain data integrity.
Standardized numeric features and encoded categorical features.
Feature Selection
We identified and prioritized features relevant to academic success, such as weekly study hours, attendance, parental support, and extracurricular involvement. A correlation matrix was developed to assess the relationships between these variables and the target variable, Grade Class.
Model Development
We tested and compared seven classification models: Decision Tree, Random Forest, Extra Trees, Gaussian Naive Bayes, SVM, KNN, and Logistic Regression. Models were trained on 80% of the dataset, while the remaining 20% served as the test set. GridSearchCV and cross-validation (10-fold) were used to optimize each model’s performance.
Evaluation Metrics
Accuracy, precision, recall, and F1-score were the main metrics used to evaluate model performance, with Decision Tree and Random Forest models yielding the best results. A confusion matrix was also generated to evaluate class-wise accuracy and inform model adjustments.
#### Model and Algorithm Descriptions
Each model and algorithm used in this study is outlined below, with detailed descriptions of assumptions and computational complexities.
– **Decision Tree Classifier**:
– also explain about Decision Tree in brief first
– **Mathematical Setting**: Recursive binary splits on features to maximize information gain, typically measured by Gini impurity or entropy.
– **Assumptions**: Assumes the data can be hierarchically divided based on certain threshold values of features.
– **Complexity**:
– **Time**: \(O(n \log n)\) where \(n\) is the number of samples.
– **Space**: \(O(d \cdot n)\), with \(d\) as the depth of the tree.
– **Random Forest Classifier**:
– also explain about Random Forest in brief first
– **Mathematical Setting**: An ensemble of decision trees, each trained on random subsets of data.
– **Assumptions**: Assumes independence between trees; reduced overfitting compared to a single decision tree.
– **Complexity**:
– **Time**: \(O(m \cdot n \log n)\) where \(m\) is the number of trees.
– **Space**: \(O(m \cdot d \cdot n)\).
– **SVM (Support Vector Machine)**:
– also explain about SVM in brief first
– **Mathematical Setting**: Finds the optimal hyperplane that maximizes the margin between classes.
– **Assumptions**: Assumes that classes are linearly separable, or that kernel transformation can make them separable.
– **Complexity**:
– **Time**: \(O(n^2)\) for standard kernels; computationally intensive for large datasets.
– **Space**: \(O(n)\) due to support vector storage.
-** Similarly explain for other as well: Extra Trees, Gaussian Naive Bayes, KNN, and Logistic Regression**
#### Hyperparameter Tuning and Evaluation
All models were optimized using GridSearchCV, with 4-fold cross-validation. Hyperparameter ranges included depths up to 10 for Decision Trees, 20-80 estimators for Random Forest and Extra Trees, and various kernels for SVM.
**Table 2: Hyperparameters Explored**
| Model | Hyperparameter | Range/Values |
|———————-|—————————|———————————————|
| Decision Tree | Max Depth | [1,2,3,4,5,6,7,8,9,10] |
| Random Forest | Estimators | [20,30,40,50,80] |
| SVM | Kernel | [‘rbf’, ‘poly’, ‘sigmoid’] |
| KNN | Neighbors | [3, 5, 7, 10, 15, 20, 40, 50] |
—
### **Results and Analysis**
—
#### Model Performance Summary
**Table 3: Model Performance Metrics**
| Algorithm | Accuracy | Precision | F1-Score | Recall | CV Score |
|——————-|———-|———–|———-|——–|———-|
| RandomForest | 0.939 | 0.939 | 0.938 | 0.939 | 0.925 |
| Extra Trees | 0.889 | 0.888 | 0.887 | 0.889 | 0.868 |
| Decision Tree | 0.850 | 0.855 | 0.851 | 0.850 | 0.917 |
– **Number of Training Runs**: Each model was trained over 10 runs for robustness, using both validation and test splits to avoid overfitting.
– **Evaluation Metrics**: Metrics include accuracy, precision, recall, and F1-score to provide a comprehensive understanding of model performance.
– **Computing Infrastructure**: All experiments were conducted on an NVIDIA GTX 1080 GPU with 16 GB RAM.
**Runtime and Energy Cost**
Average runtime per model: Random Forest (2 minutes), Decision Tree (30 seconds), SVM (4 minutes). Estimated energy cost is approximately 1.2 kWh per training session.
**Figure 7: Model Comparison Bar Chart**
Bar chart visualizing accuracy, precision, recall, and F1-scores across algorithms, confirming Random Forest as the top performer.
Feature Importance Analysis
Table 2: Feature Importance Ranking (Random Forest)Feature Importance Score
StudyTimeWeekly 0.35
ParentalSupport 0.25
Absences 0.20
Extracurricular 0.10
Parental Education 0.10
#### Confusion Matrix
**Figure 9: Confusion Matrix for Best Model (Random Forest)**
Displays the distribution of predictions across classes, highlighting areas of misclassification.
The confusion matrix shows the distribution of predictions across the different classes (A, B, C, D, F).
The diagonal elements represent the correct predictions for each class, while the off-diagonal elements represent misclassifications.
The class with the highest number of correctly predicted instances is class F, with 238 correct predictions.
Class B has the second highest number of correct predictions, with 72 instances.
Class D has the third highest number of correct predictions, with 79 instances.
Classes A, C, and F have the fewest correct predictions, with 16, 1, and 1 instances, respectively.
Interpretation:
The model appears to be performing very well overall, with an accuracy close to 94%.
The confusion matrix suggests the model is best at identifying class F instances, but struggles more with classes A, C, and F.
This could indicate that the model has learned to recognize the patterns in class F more effectively than the other classes, or that the class F instances are more distinct in the data.
Further investigation may be needed to understand the reasons behind the varying performance across different classes and identify potential areas for model improvement.
#### ROC Curve and AUC Analysis
**Figure 8: ROC Curve for Multi-Class Classification**
ROC curves for each class with AUC values, showing strong model discriminative ability (AUC ≈ 0.94).
Insights from ROC Curve:
All classes show excellent performance with an AUC of 0.94.
Sharp initial rise in all curves indicates high true positive rates.
The model maintains good performance across classes, with minimal bias.
Overall Performance:
All classes show excellent performance with AUC (Area Under Curve) = 0.94 for each class
This indicates the model has strong discriminative ability across all classes
An AUC of 0.95 is considered very good, as it’s much closer to perfect (1.0) than random (0.5)
Curve Characteristics:
All curves show a sharp initial rise, indicating high true positive rates are achieved with low false positive rates
Most classes reach approximately 0.8-0.9 true positive rate very quickly (at low false positive rates)
The red dashed diagonal line represents random classification (AUC = 0.5)
All class curves are well above this random line
Class-wise Performance:
Class 3 (red line) shows the best initial performance with the steepest rise
Class 0 (blue line) shows slightly lower performance in the middle range but catches up eventually
Classes 1, 2, and 4 show very similar performance patterns
Model Characteristics:
The model shows consistent performance across classes (same AUC = 0.94)
High true positive rates (>0.8) are achieved with false positive rates as low as 0.1
This suggests the model has good balance between sensitivity and specificity
Practical Implications:
The model is reliable for multi-class classification
It performs significantly better than random chance
It maintains good performance across all classes, showing no significant bias toward any particular class
The high AUC suggests this model would be suitable for real-world applications
Accuracy and Performance Metrics:
The overall accuracy of the best fitted model is 0.9394572025052192 (93.95%).
The macro average and weighted average of the accuracy, recall, f1-score, and support metrics are all around 0.94, indicating consistently strong performance.
Classification Report
Highest F1-scores for classes 2.0, 3.0, and 4.0.
Slightly lower recall for class 0.0, suggesting opportunities for further improvement with techniques like class weighting.
precision recall f1-score support
0.0 0.94 0.76 0.84 21
1.0 0.94 0.83 0.88 54
2.0 0.96 0.92 0.94 78
3.0 0.92 0.95 0.93 83
4.0 0.94 0.98 0.96 243
accuracy 0.94 479
macro avg 0.94 0.89 0.91 479
weighted avg 0.94 0.94 0.94 479
Additional Insights:
Class-wise Performance:
The model performs best on class 4.0, with a precision of 0.94, recall of 0.98, and F1-score of 0.96.
It also performs very well on classes 2.0 and 3.0, with F1-scores over 0.90.
The model struggles the most with class 0.0, with a relatively lower recall of 0.76 compared to the other classes.
Precision vs. Recall:
The model generally maintains a good balance between precision and recall, with F1-scores consistently around 0.90 or higher for most classes.
This suggests the model is able to make accurate predictions without sacrificing too much on the number of true positives identified.
Overall Metrics:
The overall accuracy of the model is 0.94, which is excellent.
The macro average and weighted average metrics are also very high, around 0.94, indicating consistent performance across all classes.
Potential Improvements:
The relatively lower recall for class 0.0 suggests there may be room for improvement in correctly identifying instances of this class.
Exploring techniques like class weighting or oversampling the underrepresented class 0.0 could help boost the model’s performance in this area.
—
### **Discussion**
The study reveals that parental support, study habits, and attendance are pivotal factors influencing student success, consistent with existing literature. In the context of Nepal and similar underdeveloped countries, these findings suggest that improvements in family engagement, structured study programs, and extracurricular opportunities can yield substantial benefits in academic performance. Implementing this model across schools, colleges, and EdTech platforms can empower educational stakeholders to identify students in need of support and customize interventions accordingly. Additionally, this framework provides students with clear, data-driven insights into their performance, enabling them to address personal areas of improvement and actively participate in their educational journey.
Implications for Educational Policy and Institutions
By identifying the major factors affecting student success, this model enables policymakers to take a proactive approach in designing educational policies that address specific needs within underdeveloped countries. Schools, colleges, and EdTech platforms can use the model to integrate real-time monitoring of student data, enabling dynamic, data-driven decisions that enhance educational outcomes. Additionally, targeted resources can be allocated to areas most affecting performance, potentially reducing dropout rates and fostering greater academic success across the country.
Key Insights from Feature Importance Analysis
Parental support, study time, and absences emerged as key factors in predicting academic success. Weak correlations were observed for demographic factors, suggesting that interventions focusing on parental engagement and study habits could be more impactful.
—
### **Conclusion**
This study successfully developed a predictive model for evaluating student performance in underdeveloped countries, achieving high accuracy through Random Forest. Insights highlight parental support and study habits as primary contributors to academic success, offering valuable guidance for stakeholders in the educational sector.
—
### **Future Work**
1. **Data Expansion**: Integrating additional socio-economic indicators across more regions.
2. **Model Enhancements**: Exploring deep learning
models for potentially improved accuracy.
3. **EdTech Integration**: Developing API-based tools for EdTech platforms to leverage real-time student assessments.
Feature Enhancement: Integrating additional factors such as mental health and household income to create a more holistic model.