A Thesis Submitted in Fulfilment of the Requirements for the Degree of Doctor of
hilosophy in Information and Communication Science and Engineering of the Nelson
Mandela African Institution of Science and Technology
The Cholera epidemic remains a public threat throughout history, affecting vulnerable
populations living with unreliable water and sub-standard sanitary conditions. Studies have
observed that the occurrence of cholera has also, strong linkage with seasonal weather
patterns. Over the past decades, there have been great achievements in developing cholera
epidemic models which have focused on using mathematical techniques. However, most
existing prediction systems have some challenges such as lack of flexibility, not user friendly,
in-effective and also, lack integration of essential weather variables. In addition, the use of
advanced technology such as machine learning (ML) have not been explicitly deployed in
modeling cholera epidemics in developing countries including Tanzania; due to the
challenges that come with its datasets such as missing-information, data-inconsistency,
imbalance-class and other uncertainties.
The aim of this work was to overcome and complement the existing challenges of cholera
epidemic models by taking the advantages of ML techniques. Hence, by developing an ML
model that is capable of predicting cholera epidemic outbreaks based-on seasonal weather
changes linkages in Tanzania. Secondary datasets from Tanzania Meteorological Agency
(TMA), the Ministry of Health and Social Welfare, and Dar es Salaam Water and Sewerage
Authority (DAWASCO) were used. Then, Adaptive Synthetic Sampling Approach
(ADASYN) and Principal Component Analysis (PCA) were applied to restore sampling
balance and dimensions of the dataset. In order to determine which ML algorithms were best
able to predict (yes/no) whether cholera epidemic would occur given the weather variables,
ten classification algorithms were evaluated using F1-score, sensitivity and balancedaccuracy
metrics.
The
Friedman-test
was
then
used
to
determine
whether
the
performance
of
the
models was statistically significant. Results showed that Random Forest, Bagging, and
ExtraTree classifiers had the best performance, with 74%, 74.1% and 71.9% accuracy
respectively. The ensemble method of model fine-tuning was then applied in order to obtain
one model from the three, and an overall accuracy of 78.5% was achieved. Lastly, a model
evaluation process was performed on the selected final model. The model validation process
involved four processes: The first evaluation process re-ran the final model using the same
dataset but without the weather variables; which resulted into confirming that the model with
weather variables to have higher performance compared to the model without the weather
variable. The second evaluation process re-ran the model-development procedure using datasets from Tanga and Songwe regions in order to illustrate on how the adaptive reference
model can be referenced by other researchers. The third and fourth model evaluation involved
mixed-design approach of quantitative and qualitative methods using focus group discussions
and interviewer-administered questionnaires with 500 and 20 stakeholders (including;
medical officers, epidemiological analysts, nurses, environmental experts, ICT experts and
cholera patients) respectively. The results of the third evaluation process proved that 90% of
the responses agreed that, the developed model is robust and appropriate to work in least
developing countries towards effective prediction of cholera epidemics. Whereas, the results
of the fourth evaluation process proved also that cholera ML model is better in terms of their
usability, expandability and computational complexity compared to the cholera statistical
models.
Overall, the study improved our understanding of the significant roles of ML strategies in
health-care data. However, the study could not be treated as a time series problem due to data
collection bias such as data-inconsistency in terms of time. The study recommends a review
of health-care systems in order to facilitate quality data collection and further deployment of
ML techniques in the health sector in Tanzania.