3. INTRODUCTION
With the population and the ever increasing need for various resources, we are
faced with a dilemma on how to manage our lives. In a struggle to do that, we
sometimes end up utilizing a poor or contaminated source of water for our use and
thus put our health on stake.
According to a recent survey of World Health Organization (WHO), more than 2.2
billion people in India face problems due to unsafe drinking water and 21% of the
diseases are related to impure water.
4. The proposed system aims to provide the solution to the same, by allowing users
to monitor the water quality from a given sample of water and predicts whether
the water is contaminated or not.
In the current scenario,facilities are available for testing the water samples by
bringing it to the water authorities.But the process is time consuming as it usually
takes several weeks for the reports to be received.This causes dissatisfaction to the
users.
6. PAPER 1 :- Nouraki, A.; Alavi, M.; Golabi, M.; Albaji, M. Prediction of water quality parameters using
machine learning models: A case study of the Karun River, Iran. Environ. Sci. Pollut. Res. 2021, 28, 57060–
57072. [CrossRef] [PubMed].
The growing worldwide emphasis on dealing with water quality is giving rise to
widespread research and expanding market for novel and astute monitoring systems.
The current method is laboratory process where samples are taken from water bodies
and testing is done in labs. This method is time consuming, wastage of manpower, and
not economical. So, Artificial Neural Network (ANN) is used to solve this problem. This
method eliminates chemical method of evaluating water quality parameters and is cost
effective. This paper gives brief methodology to predict unknown parameters such as
Alkalinity, Chloride, Sulphate values using known parameters such as pH, Electrical
Conductivity, TDS etc. using Levenberg-Marquardt algorithm, which helps in further
classification of water bodies for different application. Results gave accuracy of 83.94%,
87.9%, 81.736%, 79.48% in predicting chloride, total-hardness, sulphate, total alkalinity
respectively
7. PAPER 2 :- A. Abraham, D. Livingston, I. Guerra and J. Yang, "Exploring the Application of Machine Learning
Algorithms to Water Quality Analysis," 2022 IEEE/ACIS 7th International Conference on Big Data, Cloud
Computing, and Data Science (BCD), 2022, pp. 142-148, doi: 10.1109/BCD54882.2022.9900636.
In this experimental study, we use different Machine Learning algorithms to
decide the quality of the water within the San Antonio River and its tributaries
using datasets of previously collected water data from the San Antonio River
Authority along with the Kaggle water potability dataset. For each algorithm to
work, we used a set of parameters by which to measure the quality of the river
water. Out of all the algorithms utilized, we found that random forest and K-
Nearest Neighbor (KNN) were the best at achieving accurate results, with accuracy
ratings of 0.6520 and 0.6469, respectively. Using K-means, we were able to find
four distinct clusters in the San Antonio River data. The separation of these
clusters was low for the parameters used resulting in a silhouette score of 0.229.
From this data, we may be able to determine which sections of the main San
Antonio River, as well as which tributaries, are acceptable (or healthy enough) for
primary contact recreation use.
8. PAPER 3 :- H. Mohammed, I. A. Hameed and R. Seidu, "Random forest tree for predicting fecal indicator organisms in
drinking water supply," 2017 International Conference on Behavioral, Economic, Socio-cultural Computing (BESC),
2017, pp. 1-6, doi: 10.1109/BESC.2017.8256398.
Variety of modeling techniques have been widely applied for predicting levels of fecal indicator
organisms in raw water. However, deficiencies in the performances of some methods make it
difficult for implementation in full-scale water supply systems. This study examines the
efficiency of random forest (RF) which is made up of a number of decision trees in the prediction
of fecal indicator organisms in raw water based on records of conductivity, pH, color, turbidity
taken from a drinking water source in Bergen, Norway, as well as seasons. Results of the study
indicate that the method is capable of estimating important variations in levels of the
microorganisms in the raw water with acceptable accuracy. Color of water and the effect of
autumn season were the most important in explaining the variations in the levels of the coliform
bacteria, intestinal enterococci and E. coli in raw water in both the full and the reduced models.
Considerable reduction in the model out-of-bag sample error was achieved in the reduced
models, where only two most important variables were used as predictors. With further research
aimed at improving the estimation error, the random forest method can be a reliable tool for real
time prediction of potential levels of microorganisms in raw water.
9. Paper 1 gives a brief methodology to predict unknown parameters such as
Alkalinity, Chloride, Sulphate values using known parameters such as pH,
Electrical Conductivity, TDS etc. using Levenberg-Marquardt algorithm,
which helps in further classification of water bodies for different
application. Results gave accuracy of 83.94%, 87.9%, 81.736%, 79.48% in
predicting chloride, total-hardness, sulphate, total alkalinity respectively.
Paper 2 gives a conclusion that Random Forest method gives the best
accuracy when it comes to water quality prediction against various other
machine learning methods. Out of all the algorithms utilized, it was found
that random forest and K-Nearest Neighbor (KNN) were the best at
achieving accurate results, with accuracy ratings of 0.6520 and 0.6469,
respectively.
Findings and Proposals
10. In the last paper, results of the study indicate that the method is capable
of estimating important variations in levels of the microorganisms in the
raw water with acceptable accuracy. With further research aimed at
improving the estimation error, the Random Forest method can be a
reliable tool for real time prediction of potential levels of microorganisms
in raw water.
From the analysis of above three papers, we can understand that the
usage of machine learning, specifically Random Forest method gives more
accurate results when it comes to prediction of water quality. It can be
found that the most important features to be considered are pH,
Hardness, Conductivity, Turbidity etc
11. DATASET
Proposed system is implemented using the water potability dataset
from Kaggle. The water_potability.csv file contains water quality
metrics for 3276 dataset, 9 features and one class variable.
Feature lists are pH, Hardness, Solids, Chloramines, Sulfate,
Conductivity, Organic Carbon, Trihalomethanes, Turbidity
Potability is the class label.
URL: https://www.kaggle.com/datasets/adityakadiwal/water-potability
12. First five rows of the dataset is printed in the above figure. The dataset contains null values
which is later rectified by preprocessing techniques.
13. Data Preprocessing
As null values are present in
the dataset, data cleaning is
carried out and the missing
values are filled. Null values
in each feature list is filled
using the mean of the
respective feature.
Data Cleaning
15. Analysis of feature variables
pH value: PH is an important parameter in evaluating the acid–base balance
of water.
Hardness: Hardness is mainly caused by calcium and magnesium salts. These
salts are dissolved from geologic deposits through which water travels.
Solids (Total dissolved solids - TDS): Water has the ability to dissolve a wide
range of inorganic and some organic minerals or salts such as potassium,
calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These
minerals produced un-wanted taste and diluted color in appearance of water.
Chloramines: Chlorine and chloramine are the major disinfectants used in
public water systems. Chloramines are most commonly formed when
ammonia is added to chlorine to treat drinking water.
1.
2.
3.
4.
16. 5. Sulfate: Sulfates are naturally occurring substances that are found in minerals,
soil, and rocks. They are present in ambient air, groundwater, plants, and food.
6. Conductivity: Pure water is not a good conductor of electric current rather’s a
good insulator. Increase in ions concentration enhances the electrical
conductivity of water.
7. Total Organic Carbon: (TOC) in source waters comes from decaying natural
organic matter (NOM) as well as synthetic sources. TOC is a measure of the total
amount of carbon in organic compounds in pure water.
8. Trihalomethanes: THMs are chemicals which may be found in water treated
with chlorine.
9. Turbidity: The turbidity of water depends on the quantity of solid matter
present in the suspended state. It is a measure of light emitting properties of
water
17. Figure displays the count of the class variable which is namely potability. The
dataset contains 1998 number of 0s which implies contaminated water and
1278 number of 1s which implies potable water.
Analysis of class variable
19. ANALYSIS OF ALGORITHM
Random Forest
Random forests or random decision forests is an ensemble learning method for
classification, regression and other tasks that operates by constructing a
multitude of decision trees at training time.
For classification tasks, the output of the random forest is the class selected by
most trees. For regression tasks, the mean or average prediction of the
individual trees is returned.
Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts
the final output.
The random forest algorithm solves overfitting to a great extent.
20.
21. The training algorithm for random forests applies the general technique of
bootstrap aggregating (bagging) to tree learners.
Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging
repeatedly (B times) selects a random sample with replacement of the
training set and fits trees to these samples:
Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
Train a classification or regression tree fb on Xb, Yb.
After training, predictions for unseen samples x' can be made by averaging
the predictions from all the individual regression trees on x':
1.
2.
or by taking the majority vote in the case of classification trees.
22. Step 1: Select random samples from a given data or training set.
Step 2: This algorithm will construct a decision tree for every training
data.
Step 3: Voting will take place by averaging the decision tree.
Step 4: Finally, select the most voted prediction result as the final
prediction result.
The following steps explain the working Random Forest Algorithm:
This combination of multiple models is called Ensemble.
Working of algorithm
24. Data collection: A dataset with appropriate parameters like pH,
Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic
Carbon, Trihalomethanes and class variable Potability is used.
Data Pre-processing: Make the acquired data set in an organized
format. Data Cleaning is the data pre-processing method we choose.
Missing values are filled in this phase.
Split Data: In this phase we split the data that is preprocessed into
training and test data. 80% data is taken for training and the
remaining 20% data is taken for testing.
Load Train Data: The training data is loaded for training the model
using the Random Forest algorithm.
25. Train Model: The loaded data is provided for training and a model is
created using the Random Forest algorithm and it is saved for further
use.
Confusion Matrix: Confusion matrix is plotted using the algorithm to
determine True Positive, True Negative, False Positive, False Negative
metrics.
Export trained model: The trained model is now exported for the
testing purposes.
Load trained model: The trained model is exported and then loaded
for testing.
Load test data: Finally test data(input) is provided to predict whether
the water sample is contaminated or not by analyzing the provided
parameters.
The result is obtained on the user interface where the input
parameters were provided.
26. SYSTEM DESIGN
Model Planning
By splitting the dataset, a portion is used for training the model and other
for testing the model. 80% of dataset is used as training data and
remaining 20% used as testing data.
28. Model Testing
An accuracy of 71.5% is available for the trained model . Now a set of values from
the dataset is selected and used for prediction purposes. As the values that were
selected implies that the water is contaminated, 0 is displayed as the output.
29. RESULTS AND DISCUSSION
An accuracy score of 71.5% was obtained by repeated training of
the model by changing the hyper parameters like criterion,
n_estimators, random state etc..
The criterion was changed from entropy to gini as the accuracy
score for entropy was much lesser than gini.
n_estimators implies the number of trees and it was increased to
get the final accuracy.
The dataset splitting ratio was changed from 3:7 to 2:8.
31. CONCLUSION
The project is meant to be replacement to the existing manual
system of water testing as the existing system is very time
consuming and includes human labour.
The system automates the process of testing the water samples
using the various parameters of water.
Proposed system takes various parameters of water as the inputs
and predicts whether the water sample of the provided
parameters are contaminated or not.
This system is based on three IEEE papers, which suggests
Random Forest algorithm as the best algorithm for the prediction
of quality of water samples and hence Random Forest algorithm
is used for the implementation.