Analyzing Water Well Data in Tanzania

A project with Jude Buenaseda

This is the classification of well functionality in Tanzania. Located in East Africa and known for its safaris and the Serengeti National Park. It is one of the fastest growing economies in Africa but there are still a lot of communities that get left behind — especially in rural areas. One of the biggest issues in these communities is access to clean water. Only about 50% of the population have access to safe water and the other half? They collect water from wells that sometimes are of long distances and sometimes are nonfunctional. So with the help of the communities themselves, reporting data about their wells, we are attempting to help.

The Basics

It had just under 60,000 data points with information about the wells in Tanzania such as location, altitude, water quality and quantity, etc. and a target variable, the functionality of the well, separated into 3 classes, functional wells, non functional wells and functional wells that need repair. The functional class had the highest proportion of the data with non functional wells coming in soon afterwards, and wells that need repair being by far the least common class. We are looking for key features among the data to help find patterns, especially for wells that are non functional. This will help locate issues that will hopefully lead to solutions to provide clean water all across Tanzania. The features we found to be the most impactful are how the water is extracted from the well and the region it is located in, but before we get into more on that lets first look at the process.


The first thing we wanted to look at was population, and more specifically population by construction year of the 3 different classes of wells. What we found is that in general more recently constructed wells in higher population areas had a high rate of functional wells. And older wells in lower population areas skewed more towards non functional wells. Which makes sense as the earlier a well is built the more likely it will need repair just from erosion over time.

After seeing the differences in population and construction year, we wanted to see other features that would have distinct differences between the status of wells.

The first of these is the extraction type for the well seen above. We see here that most wells are extracted by gravity or hand pumps and those that aren’t, have a high rate of non functionality.

And for the region we wanted to find specific regions where the majority of data was different than the normal data which was that functional wells were the dominant class. We can see here that specifically places like Lindi and Mtwara have a high rate on non functional wells and somewhere like kigoma has a ton of wells that need repair. We can visualize this by looking at a map of Tanzania below with points plotted for each type of well.

What we wanted to know however, is why. The data itself did not show any striking differences in data between these areas and other areas. Lindi and Mtwara did have a slightly higher rate of dry wells than other areas but it was not large enough to make sense of the high proportion of non functional wells. Going deeper we found that with Mtwara/Lindi there is a problem accessing those areas, so we think it is difficult to send supplies when necessary to maintain the wells. And in Kigoma we found that soil erosion is a big problem which leads to break down of the wells as it pollutes the water and gets into the pipeline of the wells and causes further damage. So it was important to know that these areas had some in depth issues that were not part of the data but in fact a regional issue of the areas in Tanzania.

The last thing we wanted to look at was altitude and how that would affect the well status. The idea being that higher altitude areas may be harder to reach and maintain and since Tanzania has a some very high altitude points this may be a problem. But what we found below is that in fact this is not the case. 2 of the higher altitude areas on the map have a higher proportion of functional wells, as is common for the data. and the rest of the lower altitude areas are more spread out among the rest of the map. So in fact altitude may not have a large effect on the well status.


In the end we tried many different types of models to get our final results. We started with a base model using KNN as we thought the points being on a map would cluster in a way that made KNN useful. This gave us a decent score but one issue we ran into was that KNN could not handle the missing values without imputing them or removing them. They could not be removed however as the holdout data set also had missing values. So we decided to do 2 things. Firstly, we used a tuned XGBoost model which handled the missing values for us. This ended up being our best model but we did not know at the time so we tried some more things. So the next thing we did was use a logistic regression model to impute the missing values. This allowed us to use all different types of models. So that’s what we did, trying random forest, knn etc. The best model among those was in fact the tuned XGBoost model again but since the score was worse than the original XGBoost model we stuck to the first one.


And all of that leads to the importance of knowing which wells function, need repair, or don’t work at all. This can have a huge impact on Tanzania’s communities. Using Taarifa’s data and the models we created, we can improve maintenance operations and reveal problems that may be overlooked like causational issues with soil erosion or the lack of attention to remote areas due to minimal access points.