Anomaly detection on longitudinal data with applications in cloud & healthcare

  • Abdullahi Abubakar

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

Over a decade, analysing longitudinal data has presented a challenge in meeting the demands of extracting useful knowledge. For instance, as cloud/data centres grow in scale and complexity, effective monitoring and management of the cloud becomes a critical challenge. Competition for resource sharing and virtual machine overload are prone to cause anomalies, which will possibly cause downtime. This will seriously affect the reliability and availability of the entire cloud infrastructure. Given the increasing number of data-rich application fields including healthcare systems and seismic activities, there is a growing research interest in anomaly detection. Despite several attempts to create anomaly detection frameworks for cloud/data centres, developing a framework that properly identifies anomalous Virtual Machines (VMs) in a large-scale, highly dynamic cloud environment remains a difficult task. A framework (Predictive Ensemble based Anomaly Detection System-PEADS) is developed to monitor virtual machine CPU utilisation (as time-series data) by handling massive quantities of data. It is designed to handle low-latency reads and updates in a linearly scalable and fault-tolerant way. Furthermore, it can differentiate a true drift from an anomaly. Additionally, it has the ability to detect anomalies as soon as they occur because it was built based on prediction (forecast) algorithms; this eventually leads to early detection. Early anomaly detection can lead to a potential disaster prevention, such as VM failure in the cloud/data centre. This will help cloud providers to plan VM migration. PEADS provides better performance, high accuracy, and faster analysis than some state-of-the-art anomaly detection systems such as, DQR-AD, DeepAnT and VAE-LSTM. The success can be attributed to the proposed algorithm that reduce false alarms while maintaining accurate detection. Additionally, the windowing technique and similarity measure also played a significant role in its success. More importantly, the fact that it was an ensemble system whereas DQR-AD, DeepAnT and VAE-LSTM are based on single models also contributed to its success.

In addition to the anomaly detection on virtual machines in cloud settings, we also utilised the outlier detection technique to predict Dengue fever outbreaks in Vietnam. Dengue Fever is an emerging mosquito-borne infectious disease that affect hundred millions of people each year with considerable morbidity and mortality rates, especial in children and the elderly. The World Health Organization listed Dengue fever (virus) among the top ten diseases responsible for the most global deaths. The spread of Dengue virus has been linked to the interplay between extreme weather events and mosquito dynamics. Therefore, a framework that utilises meteorological (climate) and Dengue fever vulnerability trends that might be predictive of Dengue fever epidemics, especially at the city (province), regional level and the entire Vietnam is proposed. Exploratory data analysis is performed to identify the correlation between Dengue incidences and climate variables using outlier detection and visualisation techniques. Finally, Dengue outbreaks across several provinces of Vietnam are predicted using the classification technique. Seventeen distinct machine learning algorithms as the base learners are utilised to generate ensemble prediction. These base learners are based on independent procedures and have diverse decision boundaries to predict the Dengue fever outbreak. Several models are constructed and validated for each algorithm using the cross-validation technique. The models with high training/validation accuracy from each machine learning algorithm are then returned as the final models. These models form the basis for ensemble prediction. For every prediction algorithm, four evaluation metrics were used to ascertain the quality of the prediction. These metrics include accuracy, balance accuracy, specificity, and sensitivity. The study period for predicting (forecasting) Dengue fever as a range of time t, from zero to six months in the future. For every forecast study period, the model is evaluated and the performance of every prediction on the base models as well as the ensemble model are measured. The ensemble method performs better than the base models. It can successfully forecast a Dengue fever outbreak six months in advance with 90% accuracy, but with a modest decrease in accuracy compared to short-term forecasting.

Thesis is embargoed until 31 July 2024.


Date of AwardJul 2023
Original languageEnglish
Awarding Institution
  • Queen's University Belfast
SponsorsPetroleum Technology Development Fund
SupervisorThai Son Mai (Supervisor), Peter Kilpatrick (Supervisor) & Dimitrios Nikolopoulos (Supervisor)

Keywords

  • Anomaly detection
  • time series
  • virtual machines
  • dengue fever
  • parallel computing
  • ensemble methods
  • climate variable
  • longitudinal data
  • cloud computing
  • prediction
  • time-series forecasting
  • concept drift

Cite this

'