Predictive analysis on aircraft data

Classification of aircraft maintenance status with a distributed system.

A distributed decision tree classifier built with Spark that predicts, given KPIs of a particular aircraft, whether it will need to be in maintenance soon or not. The whole process is split into three separate pipelines.

Data management pipeline

Project image

The system reads data from different sources, including csv text files, remote databases and our own data warehouse using a Hadoop Distributed File System. The data is then filtered and aggregated using Spark transformations and outputted as a single text file with a dataframe-like structure with only the relevant attributes.

Data management pipeline

Project image

A random split is then performed to obtain train and test partitions and the classifier from MLlib is trained on the former. The trained model is saved for further use and it is also tested with the remaining data to obtain an estimate of its accuracy.

Run-time Classifier pipeline

Project image

When new data arrives, this pipeline performs a similar preprocessing to the data management one and loads the previously trained classifier. The result is outputted to the user.