Tanzanian Water Well Classification

https://github.com/MullerAC/tanzanian-water-well-classification

Overview

Tanzania is a developing country that struggles to get clean water to its population of 57 million as part of an ongoing competition at DrivenData, this projects goal was to predict the functionality of water wells given data provided by Taarifa and the Tanzanian Ministry of Water.

This is a ternary classification problem: all points are either functional, nonfunctional, or functional and in need of repair. The data has 39 independent variables relating to the management, use, and location of the pumps. 59,400 data points were provided for model creation, with an additional 14,850 unlabeled points given to create contest submissions on.

As stated by the competition rules, accuracy was chosen as the main scoring metric of the models. If this model were to be used in real life, predicting non functioning water sources as fully functional would be much more disastrous than the reverse, and so a metric that takes into account false negatives, such as recall of F1 score.

Exploratory Data Analysis

A look at the 19 categorical variables revealed that many of them have thousands of categories. This is clearly useless, so categories with less than 1% representation in the overall dataset are binned into an “other” category. Missing values in these variables are also put into the “other” category. Reducing the number of features related to the pump’s geographical location leaves 13 categorical variables. Creating dummy features for these categories leaves 100 total independent variables in the data.

df['funder'] = df[col].map(lambda x: 'other' if x=='0' else x)
df['installer'] = df[col].map(lambda x: 'other' if x=='0' else x)
df['scheme_management'] = df[col].map(lambda x: 'other' if x=='Other' else x)
for item in ['subvillage', 'region_code', 'district_code', 'lga', 'ward', 'scheme_name']:
to_drop.append(item)
categorical.remove(item)
df_dummies = pd.get_dummies(df.drop(to_drop+target, axis=1))

The continuous variables had some missing data that needed to be dealt with. In order to run KNN imputation on the dataset, the missing data was changed to nan values and the remaining data was min-max scaled. Carrying out the KNN imputation leaves 100 columns of data scaled from 0 to 1, which is then combined back with the dependent variable to create our cleaned data.

scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_dummies), index=df_dummies.index, columns=df_dummies.columns)
imputer = KNNImputer()
df_imputed = pd.DataFrame(imputer.fit_transform(df_scaled), index=df_scaled.index, columns=df_scaled.columns)

The same process is applied to the submission data: removing unneeded features, changing unrepresented categories to ‘other’, creating dummy columns, scaling, and imputing missing values. This data is then ready to be used on the same models that we create with our test data.

Baseline Models

Train data is split from our cleaned data and baseline (default parameters) models are created using many different model types. Test data is then predicted on these models, and the accuracy is used to determine which models are best.

  • Logistic Regression: 73.40%
  • K Nearest Neighbors: 77.98%
  • Naive Bayes: 54.23%
  • Decision Tree: 74.64%
  • Bagged Trees: 78.38%
  • Random Forest: 79.30%
  • Adaboost: 72.74%
  • Gradient Boost: 74.99%
  • XGBoost: 74.42%
  • Support Vector Machines: 77.04%
def get_metrics(y_test, X_test, model):
labels = y_test.to_numpy()
preds = model.predict(X_test)

metrics = {}
metrics['accuracy'] = accuracy_score(labels, preds)
metrics['f1'] = f1_score(labels, preds, average='weighted')
metrics['precision'] = precision_score(labels, preds, average='weighted')
metrics['recall'] = recall_score(labels, preds, average='weighted')

return metrics
from sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
logreg.fit(X_train, y_train)
get_metrics(y_test, X_test, logreg)# repeat for other models

Of these, hyperparameter tuning will be performed on KNN, Random Forest, XGBoost, and SVM. XGBoost did not perform well, but is more sensitive to parameter tuning, so it shouldn’t be discounted yet.

Final Models

GridSearchCV was used on on the KNN, Random Forest, XGBoost, and SVM models. A second pass was also made, narrowing in on the best paremeters. XGBoost saw the most improvement upon tuning, but KNN and SVM have few hyperparameters to tune, and saw little improvement. Bagging the models slightly improved the Random Forest and SVM models, but decreased the accuracy of the KNN and XGBost models. The accuracy of these improved models was measured against the same test data as the baseline.

  • KNN: 78.57% (improvement of 0.59)
  • Random Forest: 80.06% (improvement of 0.76)
  • XGBoost: 80.25% (improvement of 5.83)
  • SVM: 77.70% (improvement of 0.66)
param_grid = {
'n_estimators': [10, 100, 500], # default 100
'max_depth': [None, 10], # default None
'max_features': ['auto', 50, None] # default 'auto': auto=sqrt(# of features)=10, None=# of features=100
}
forest = RandomForestClassifier()
grid_search = GridSearchCV(forest, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)
grid_search.best_params_# this can be repeated with values narrowing in on the best parameters
# repeat for other models

Contest Submissions

Using the best hyperparameters resulting from the grid search, the models are again fit, this time from the entire dataset, not just the 75% declared the train data. This is because the entire test set is now to be treated as the train set, and the submission data as the test set. Predictions are then run on the submission data, and they are submitted to the contest. Only the accuracy of each submission is returned.

  • KNN: 79.99%
  • Random Forest: 81.49%
  • XGBoost: 81.50%
  • SVM: 78.04%

Conclusions

XGBoost performed the best of any model, although it only barely beat out the Random Forest. The comparatively poor performance of the KNN and SVM models indicates that the data is not easily seperable, as these models are both distance-based.

A highest accuracy of 81.50% ends up at a rank of 1303 of 10458.

Random Forests has less false positives than XGBoost, and so is likely better in real life situations.

Future Improvements

When cleaning the data, the scaling and imputing was done before the train test split. This would be intended for the submission models, but caused data leakage in the testing phases, and may have lead to some sub-optimal parameters used. This could be fixed.

Much data was lost when the excess number of reducing categories. Given more time or processing power, less categories should be removed, which could lead to more accurate results.

More grid searches could be run on the XGBoost model to improve its performance. It is more responsive to tuning than the other models, meaning it has more room for improvement.

If the results of the submission data are provided, the recal and precision of the models can be determined, which could lead to discovering where the models are wrongly predicting, and therefore what could be improved.

Student of Data Science