Predicting the Functionality of Water Pumps with XGBoost

Author:Murphy  |  View: 27694  |  Time: 2025-03-23 18:27:11
Photo by Kelly: https://www.pexels.com/photo/close-up-of-a-child-s-hands-catching-water-from-the-spout-of-a-water-pump-3030281/

Table of Contents

IntroductionObjective Tools/FrameworksExploratory Data AnalysisFeature EngineeringCreating Training and Testing SplitsDetermining the Evaluation MetricCreating Baseline ModelsData Modeling ApproachHyperparameter Tuning ApproachXGBoost ModelCatBoost ModelLightGBM ModelSelecting the Best ModelModel InterpretationModel DeploymentLimitationsConclusionReferences

Introduction

Note: This project is inspired by the Pump it Up: Data Mining the Water Table competition hosted by DrivenData.

Tanzania currently suffers from a severe water crisis, with 28 percent of the population lacking access to safe water. One feasible way to combat this crisis is to ensure that the water pumps installed across the country remain functional.

Using the data procured by Taarifa, which aggregates data from the Tanzania Ministry of Water, there is an opportunity to leverage machine learning to detect water pumps that are non-functional or need repair.


Objective

The aim of this project is to train and deploy a machine learning model that predicts whether a water pump is functional, non-functional, or functional but needs repair.


Tools/Frameworks

This project requires the use of various tools and frameworks.

The scripts that facilitate data analysis and modeling are all written in Python.

Data preprocessing and feature engineering are carried out with the Pandas and Scikit learn modules. Data modeling is conducted with a combination of Scikit Learn and other machine learning libraries.

The final model is incorporated in a web application built with the Streamlit library. This application is then deployed using Heroku.

For a more comprehensive understanding of the dependencies of the project, please visit the GitHub repository.


Exploratory Data Analysis

Performing exploratory data analysis (EDA) will shed light on the makeup of the dataset, what processes the data should be subject to, and which machine learning algorithms should be considered.

The provided data comprises 59400 data points and 41 features, including the target label.

The 41 features are the following:

Code Output (Created By Author)

Note: For information on each of these features, please visit the problem description for the competition.

The status_group feature will serve as the target label for the project. It reveals whether the water pump is functional, non functional, or functional but needs repair.

As shown in the code output, the data is predominantly composed of categorical features.

Also, many of the features report similar information. For instance, the features latitude, longitude, region, and region_code all show the location of the water pumps. Including all of these features would be unnecessary and can even hamper the models' performance.

Furthermore, several features in the dataset have missing values.

Missing Values (Created By Author)

Finally, the distribution of values in the target label indicates that the data is imbalanced, with the class "functional needs repair" being underrepresented.

Target Label (Created By Author)

Feature Engineering

From the results of the EDA, it is clear that many of the features need to be removed or modified prior to any modeling.

  1. Remove irrelevant columns

The id, recorded_by, and wpt_name features are removed as they have no influence over the target label.

2. Remove redundant columns

Features that include redundant information should also be removed. These features include: subvillage,latitude, longitude, region_code, district_code, lga, ward, scheme_name, extraction_type, extraction_type_group, payment, water_quality, quantity, source, source_type, waterpoint_type, and management.

3. Create the "age" feature

The construction_year and date_recorded features have no bearing on the status of the water pump. However, by using these two features, we can derive the "age" (i.e. number of years since construction) of the water pumps to see how old they are.

2. Remove weak predictors

Finally, predictors that do not have a strong enough relationship with the target label should be removed.

The relationships between the numerical features and the target label are evaluated using ANOVA. The following code snippet creates a graph that displays the p-value for each feature.

Code Output (Created By Author)

The relationships between the categorical features and the target label are evaluated using the chi-square test of independence. The following code snippet creates a graph that displays the p-value for each feature.

Code Output (Created By Author)

Out of the tested numeric and categorical features, only the num_private feature will be removed due to its high p-value.

After the feature selection process, the dataset has shrunk from having 41 features to having 18 features.


Creating Training and Testing Splits

The original dataset is split into training and testing sets with a stratified split, which ensures that the groups in the target label have equal representation in each split.


Determining the Evaluation Metric

The data is prepared for modeling, but it is important to first determine what evaluation metric is most suited for this project.

To do so, let's consider the priorities of the end users.

Created By Author

The machine learning solution should increase the accessibility to clean water by detecting water pumps that are either non-functional or need repair. The solution should also limit the money and resources expended by correctly identifying water pumps that do not need repair or replacement.

It's worth noting that false predictions are highly undesirable.

Failing to correctly identify water pumps needing repair or replacement (i.e., false negative) will reduce access to clean water. Residents relying on these water pumps will be unable to use water for agriculture and sanitation purposes, which would subject them to a lower standard of living. Moreover, the governments and/or organizations that constructed the water pumps will lose their reputation.

Oh the other hand, failing to correctly identify water pumps that are functional (i.e., false positive) is also an undesirable outcome. It would result in limited money and resources being expended on water pumps that do not even need repair or replacement.

Given the substantial costs of false positives and false negatives, the machine learning models should consider both precision and recall metrics. However, since the false negatives appear to be more consequential, more emphasis should be placed on getting a higher recall.

Thus, the evaluation metric that will be used for the project is the f2-score metric, which considers both precision and recall, but gives greater weightage to recall.

F2-score Formula

Creating Baseline Models

A baseline will help contextualize the results of the machine learning model. This project will utilize two baseline models as reference: a dummy classifier and a logistic regression.

The dummy classifier will always make random predictions for the water pumps' functionality.

A logistic regression with default hyperparameters is trained after the categorical features and missing data have been encoded and imputed, respectively.

The role of the logistic regression is to show how a simple model will perform with the available data. If the logistic regression does not outperform the dummy classifier, it will signal issues with the data.

Code Output (Created By Author)

As sown in the output, the logistic regression yields a much higher f2-score than the dummy classifier, which suggests that the data has enough signal.


Data Modeling Approach

The procedure for building the models can be captured in the following flowchart:

Data Modeling Flowchart (Created By Author)

The three models that will be considered for the project are the Catboost classifier, the LGBM Classifier, and the XGBoost classifier. All of these classifiers incorporate ensemble learning, which is well-suited for imbalanced data. Furthermore, they provide support for categorical features and/or missing data.

For each of these models, the optimal hyperparameter set is determined. The models are then trained with the hyperparameters and are evaluated with the testing set.

Once every model is tested, the best model (i.e., the model with the highest f-2 score) will be selected. This model will be used in the web application.


Hyperparameter Tuning Approach

The hyperparameter tuning approach itself comprises many key techniques, so it is worth breaking down with another flowchart.

Hyperparameter Tuning Flowchart (Created By Author)

The hyperparameter tuning will be executed using the Optuna library.

The procedure entails creating an Optuna study. In each study, the classifier is trained and evaluated with 100 hyperparameter sets. Each hyperparameter set is evaluated with stratified cross validation, which entails splitting the training data into multiple folds, with a model being trained with each fold.

Each hyperparameter set will be measured by the average f-2 score of the models that are trained with it. The hyperparameters that yield the highest f-2 score will be deemed as the best hyperparameter set.


XGBoost Model

To demonstrate the data modeling and hyperparameter tuning carried out for the classifier, the procedure for training and evaluating the XGBoost are shown in the following code snippets.

First, an optuna study is run to find the optimal hyperparameters for the XGBoost classifier.

The study will identiy the best hyperparameter combination. These hyperparameters are then used to train an XGBoost classifier, which will then be evaluated with the testing set.

Code Output (Created By Author)

CatBoost Model

With the procedure used for the XGBoost classifier, a CatBoost classifier is trained and evaluated against the testing set (for the entire codebase, please visit the GitHub repository).

Code Output (Created By Author)

LightGBM Model

With the procedure used for the XGBoost classifier, a LightGBM classifier is trained and evaluated against the testing set (for the entire codebase, please visit the GitHub repository).

Code Output (Created By Author)

Selecting the Best Model

The performances of all of the models are captured in the following table.

Performance of Each Model (Created by Author)

Since the XGBoost classifier yields the highest f-2 score (≈0.80), it is deemed as the best model.


Model Interpretation

The performance of the XGBoost model can be contextualized with a classification report and confusion matrix that compares the predicted values to the actual values.

Classification Report (Created By Author)
Confusion Matrix (Created By Author)

As shown in the classification report and confusion matrix, the precision and recall for functional and non-functional water pumps are relatively high. However, the model does not perform as well with water pumps that are functional but need repair.


Model Deployment

Now that the modeling process is complete, the model should be deployed into a web application that can be accessed by end users.

The web app is built with the Streamlit library in a file named app.py. The underlying code for this file is shown below:

When run with the streamlit run app.py command, the app should look like the following:

Streamlit App (Created By Author)

The app contains a sidebar in which users can input the parameters of the water pump of interest. After clicking the "Predict Water Pump Condition" the XGBoost model will predict whether the water pump with the selected features is functional, non-functional, or functional but needs repair. The results are outputted in the application.

Making a Prediction (Created By Author)

This web application has also been hosted using Heroku, so you can access it yourself by clicking the link below:

Streamlit


Limitations

While the project has resulted in a functional web application, it has been subject to certain limitations that are worth noting.

1. No existing solution as reference

While the solution proposed does enable users to determine functionality of water pumps in Tanzania, it would be difficult to pitch it to clients since there is no existing solution that can serve as a reference. Thus, it is difficult to determine how much money would be saved with this model and to what extent it would improve accessibility to water.

2. Limited knowledge of constraints

The project is conducted with the assumption that a false negative (i.e., identifying a non-functional water pump as functional) is more undesirable than a false positive (i.e., identifying a functional water pump as non-functional). However, this assumption is only valid if there are no significant limitations in terms of money and resources available for repairing and replacing water pumps.

Unfortunately, without a clear understanding of such constraints, it is not possible to determine the most fitting evaluation metrics for the machine learning models.

3. Lack of domain knowledge

There were many unique values in the categorical features in this dataset. However, DrivenData provided no explanation with regard to what these values represent. As a result, the project lacked evidence-based strategies for processing the categorical features.


Conclusion

Photo by Alexas_Fotos on Unsplash

Overall, the project aimed to leverage the data collected by Taarifa to train a machine learning model that predicts the functionality of water pumps and incorporate it in an application that has business value.

For access to the entire codebase, please visit the GitHub repository:

GitHub – anair123/Detecting-Faulty-Water-Pumps-With-Machine-Learning

Thank you for reading!

References

Bull, P., Slavitt, I., & Lipstein, G. (2016, June 24). Harnessing the power of the crowd to increase capacity for Data Science in the Social Sector. arXiv.org. https://arxiv.org/abs/1606.07781

Tags: Data Science Machine Learning Programming Python Xgboost

Comment