A Thesis Submitted in Fulfillment of the Requirements for the Degree of Doctor of
Philosophy in Information and Communication Science and Engineering of the Nelson
Mandela African Institution of Science and Technology
Student dropout is among the challenges that face most schools in developing countries
particularly in Africa. In Tanzania alone, student dropout in secondary schools is pronounced
to be around 36%. In addressing the student dropout problem, a thorough understanding of the
fundamental factors that cause the student dropout is essential. Several researchers have
identified and proposed causes, methods and strategies that will help to reduce or stop the
student dropout problem, however, most of the proposed solutions didn’t show promising
results and the students dropout trend continue to increase over time. This study focused on
developing a data driven approach that will help to identify and predict students who are at risk
of dropping out of school in order to facilitate an intervention program as an active measure in
eliminating the problem of dropout in Tanzania. In doing so, (a) 122 research articles were
examined, (b) 4 focus group discussions and 2 round table surveys with 38 respondents from
5 districts (Arusha, Mbeya, Kisarawe, Rufiji and Nzega) were conducted, and (c) 3 datasets
from Tanzania and India were used in order to identify factors that contribute significantly to
student dropout problem, disclose the best classifier from the commonly used classifiers
(Logistic Regression, Random Forest, K-nearest Neighbor and Multilayer Perceptron) and
assessing the data balancing techniques for predictive performance of the model. Results
revealed that, most of the respondents mentioned students’ gender, age, parent’s income,
number of qualified teachers and remoteness as the main contributing factors to the students’
dropout problem in secondary schools. Furthermore, results from the examined articles
indicated that, most studies conducted in developing countries focused on the social aspects of
student dropout, and a paltry mentioned the use of other approaches such as machine learning.
Nevertheless, results from data driven approach development shows that the Logistic
Regression and Multilayer perceptron achieved the highest performance when over-sampling
technique was employed. Also, the hyper parameter tuning improved the algorithm's
performance compared to its baseline settings, and stacking of the classifiers improved the
overall predictive performance of the developed approach. The study, therefore, recommends
the developed approach to be considered by relevant authorities in identifying and predicting
students at risk of dropping out for early intervention, planning and informative decisions
making on addressing the student dropout problem.