1. 1
Project Report on
PREDICTION OF BEST LOCATION FOR SOLAR
FARM IN ORDER TO MEET ENERGY DEMAND
AND COMPANY PROFIT.
Submi ed by
SHRUTEJ JARIWALA
PARSHWA BHAVSAR
VIRAL SUREJA
VISHNUVARDHAN CHOWDARY
SUBJECT – AI BASICS
PROF: JEAN-MICHEL TAVERNE
2. 2
1. Problem Summery
Finding location for solar farm, which can fulfil customer need of energy demand:
1) 2 million kWh/a. Dataset
2) 3 million kWh/a
Limitations:
The building space in the regions is limited so may be built in the regions max following area:
-North-West: 3,000m2
-North-East: 3,000m2
-South-West: 2,000 m2
-South-East: 2.000m2
For one square meter of solar plant Smart Energy LLC has to pay 100€ for the material plus
the cost of the land. -2 million € for scenario -1 and -In order to fulfil scenario 2, a budget of 3
million € can be invested.
Objective
We have to find best solar farm location which can fulfil the energy need of the
consumer along with Company profit. For that we want to apply machine learning
techniques in order to find solution of this problem.
1.1. What is Machine Learning?
Machine learning is a subfield of computer science that is concerned with building algorithms
which, to be useful, rely on a collection of examples of some phenomenon. These examples
can come from nature, be handcrafted by humans or generated by another algorithm.
Machine learning can also be defined as the process of solving a practical problem by
1) gathering a dataset,
2) algorithmically building a statistical model based on that dataset.
That statistical model is assumed to be used somehow to solve the practical problem.
To save keystrokes, I use the terms “learning” and “machine learning” interchangeably.
(Burkov, 2020)
Types of learning can be supervised, semi-supervised, unsupervised and reinforcement.
1.2. Supervised Learning
In supervised learning1, the dataset is the collection of labeled examples {(xi, yi)}N
i=1.Each element xi among N is called a feature vector. A feature vector is a vector in
which each dimension j = 1, . . ., D contains a value that describes the example
somehow. That value is called a feature and is denoted as x(j). For instance, if each
example x in our collection represents a person, then the first feature, x(1), could contain
height in cm, the second feature, x(2), could contain weight in kg, x(3) could contain
3. 3
gender, and so on. For all examples in the dataset, the feature at position j in the feature
vector always contains the same kind of information. It means that if x(2) i contains
weight in kg in some example xi,then x(2) k will also contain weight in kg in every
example x k, k = 1, . . . , N . The label yi can be either an element belonging to a finite
set of classes {1, 2, . . ., C}, or a real number, or a more complex structure, like a vector,
a matrix, a tree, or a graph. Unless otherwise stated, is either one of a finite set of classes
or a real number2. You can see a class as a category to which an example belongs.
For instance, if your examples are email messages and your problem is spam
detection, then you have two classes {spam, not spam}. The goal of a supervised
learning algorithm is to use the dataset to produce a model that takes a feature vector x
as input and outputs information that allows deducing the label for this feature vector.
For instance, the model created using the dataset of people could take as input a
feature vector describing a person and output a probability that the person
has cancer.
1.3. Unsupervised Learning
In unsupervised learning, the dataset is a collection of unlabelled examples {xi}N i=1.
Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to
create a model that takes a feature vector x as input and either transforms it into
another vector or into a value that can be used to solve a practical problem. For
example, in clustering, the model returns the id of the cluster for each feature vector
in the dataset. In dimensionality reduction, the output of the model is a feature vector
that has fewer features than the input x; in outlier detection, the output is a real
number that indicates how x is different from a “typical” example in the dataset.
1.4 Reinforcement Learning
Reinforcement learning is a subfield of machine learning where the machine “lives”
in an environment and is capable of perceiving the state of that environment as a
vector of features. The machine can execute actions in every state. Different actions
bring different rewards and could also move the machine to another state of the
environment.
2. Datasets:
Installed Solar plants
New locations data sets.
We have two datasets; first one “installed Solar plants” has data of 20 of already
installed power plant’s data, which gives insight on every plant’s sunshine hours per
year, solar panel m^2 and value of generated energy in kwh/a.
Second data set has data of 56 months start from January 2018 to August 22 for 59 new
unique location, which also gives sunrise hours, price per m^2, average wind speed.
4. 4
3. Requirements:
Python
IDE: Jupiter notebook
Libraries: Pandas, NumPy, matplotlib, seaborn , scikit-learn
3.1. Libraries:
import pandas as pd - pandas is a popular Python-based data analysis toolkit which
can be imported using import pandas as pd. It presents a diverse range of utilities,
ranging from parsing multiple file formats to converting an entire data table into a
NumPy matrix array. This makes pandas a trusted ally in data science and machine
learning. Similar to NumPy, pandas deal primarily with data in 1-D and 2-D arrays;
however, pandas handle the two differently
import matplotlib. pyplot as plt - matplotlib. pyplot is stateful, in that it keeps track
of the current figure and plotting area, and the plotting functions are directed to the
current axes and can be imported using import matplotlib. pyplot as plt.
import seaborn as sns - Seaborn is a library for making statistical graphics in Python.
It builds on top of matplotlib and integrates closely with pandas’ data structures.
Seaborn helps you explore and understand your data. Its plotting functions operate on
data frames and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots. Its dataset-
oriented, declarative API lets you focus on what the different elements of your plots
mean, rather than on the details of how to draw them
import NumPy as np - NumPy provides a large set of numeric datatypes that you can
use to construct arrays. NumPy tries to guess a datatype when you create an array, but
functions that construct arrays usually also include an optional argument to explicitly
specify the datatype.
4. Initial thoughts and Observation
We can observe that both datasets have one common column have “sunrise
hour”.
For every region we have limited m^2 per area, if we can find out how
much energy is generated in one m^2 area in every region we can know the
how much energy can be generated.
We also want to find out if there is any corelation between sunshine hour
and energy per m^2.
5. Solution process:
6.1 importing data:
Import data using pandas library.
5. 5
We divided Generated energy by size of solar panel area m^2,then we find energy for
one meter square area.
6.2 Data exploration : installed plant data
From the use of seaborn library, we plotted pair plot of the data
df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel m2']
sns.pairplot(df)
6. 6
We try to find out how much sunrise hour per year column is corelated to
energy per meter^2 column.
Observation:
1) Energy per m^2 is directly propotional to Sunshine Hours per year.
2) Solar panel m^2 is directly propotional to sunshine Hours per year.
#Let's see how much is it corelating..
#we find corelation and plot it with heatmap.
sns.heatmap(df.corr(),annot = True)
7. 7
Observation:
1) Energy per meter^2 is completely dependent on sunshine hour per year .
2) We can predict energy per meter^2 by sunshine hours per year, So we can
apply regression model.
6.3 Classification vs. Regression
Classification is a problem of automatically assigning a label to an unlabelled example.
Spam detection is a famous example of classification. In machine learning, the
classification problem is solved by a classification learning algorithm that takes a
collection of labelled examples as inputs and produces a model that can take an
unlabelled example as input and either directly output a label or output a number that
can be used by the analyst to deduce the label. An example of such a number is a
probability.
In a classification problem, a label is a member of a finite set of classes. If the size of
the set of classes is two (“sick”/ “healthy”, “spam”/“not spam”), we talk about binary
classification (also called binomial in some sources). Multiclass classification (also
8. 8
called multinomial) is a classification problem with three or more classes. While some
learning algorithms naturally allow for more than two classes, others are by nature
binary classification algorithms. There are strategies allowing to turn a binary
classification learning algorithm into a multiclass one.
Regression is a problem of predicting a real-valued label (often called a target) given
an unlabelled example. Estimating house price valuation based on house features, such
as area, the number of bedrooms, location and so on is a famous example of regression.
The regression problem is solved by a regression learning algorithm that takes a
collection of labelled examples as inputs and produces a model that can take an
unlabelled example as input and output a target. (Burkov, 2020)
6.4 Linear Regression
Linear regression is a popular regression learning algorithm that learns a model which
is a linear combination of features of the input example.
6.4.1 Problem Statement
We have a collection of labeled examples {(xi , yi)} N i=1, where N is the size of the
collection, xi is the D-dimensional feature vector of example i = 1, . . . , N, yi is a real-
valued1 target and every feature x (j) i , j = 1, . . . , D, is also a real number. We want to
build a model fw,b(x) as a linear combination of features of example :
x: fw, b(x) = wx + b,
where w is a D-dimensional vector of parameters and b is a real number. The notation
fw,b means that the model f is parametrized by two values: w and b. We will use the
model to predict the unknown y for a given x like this: y ← fw,b(x). Two models
parametrized by two different pairs (w, b) will likely produce two different predictions
when applied to the same example. We want to find the optimal values (w∗ , b∗ ).
Obviously, the optimal values of parameters define the model that makes the most
accurate predictions. You could have noticed that the form of our linear model in eq. 1
is very similar to the form of the SVM model. The only difference is the missing sign
operator. The two models are indeed similar. However, the hyperplane in the SVM plays
the role of the decision boundary: it’s used to separate two groups of examples from one
another. As such, it has to be as far from each group as possible. On the other hand, the
hyperplane in linear regression is chosen to be as close to all training examples as
possible. You can see why this latter requirement is essential by looking at the
illustration in Figure 1. It displays the regression line (in red) for one-dimensional
examples (blue dots). We can use this line to predict the value of the target ynew for a
new unlabelled input example xnew. If our examples are D-dimensional feature vectors
(for D > 1), the only difference with the one-dimensional case is that the regression
model is not a line but a plane or a hyperplane (for D > 2 ).
9. 9
Now you see why it’s essential to have the requirement that the regression
hyperplane lies as close to the training examples as possible: if the red line in
Figure.1 was far from the blue dots, the prediction ynew would have fewer chances
to be correct.
6.4.2 Solution
To get this latter requirement satisfied, the optimization procedure which we use to
find the optimal values for w∗ and b∗ tries to minimize the following expression:
In mathematics, the expression we minimize or maximize is called an objective
function, or, simply, an objective. The expression (fw,b(xi) − yi)^2 in the above
objective is called the loss function.
It’s a measure of penalty for misclassification of example I. This particular choice of
the loss function is called squared error loss. All model-based learning algorithms
have a loss function and what we do to find the best model is we try to minimize the
objective known as the cost function. In linear regression, the cost function is given
by the average loss, also called the empirical risk. The average loss, or empirical
10. 10
risk, for a model, is the average of all penalties obtained by applying the model to
the training data.
Why is the loss in linear regression a quadratic function? Why couldn’t we get the
absolute value of the difference between the true target yi and the predicted value f (xi)
and use that as a penalty? We could. Moreover, we also could use a cube instead of a
square.
we decided to use the linear combination of features to predict the target. However, we
could use a square or some other polynomial to combine the values of features. We
could also use some other loss function that makes sense: the absolute difference
between f (xi) and yi makes sense, the cube of the difference too; Sounds easy, doesn’t
it? However, do not rush to invent a new learning algorithm. The fact that the binary
loss (1 when f (xi) and yi are different and 0 when they are the same) also makes sense,
right? If we made different decisions about the form of the model, the form of the loss
function, and about the choice of the algorithm that minimizes the average loss to find
the best values of parameters, we would end up inventing a different machine learning
algorithm. (Burkov, 2020)
Implementing Linear Regression:
We took “Sunshine Hours” column as Feature and Energy per meter^2 as Label. Then
split the data in 60/40 ratio for creating train and test data. and we imported Liner
Regression model form scikit-learn library. Further we test data on remaining 40% data.
By predict function we predict energy from test data. And then we compare predicted
value to test labels. That is how we find out error function of our model.
Algorithm also gives us regression coefficient and intercept.
12. 12
Observation:
From scatter plot we can observe the straight line. That shows little
deviation and great accuracy. Then we check the absolute mean error
and R^2 score.
6.5. Now we do analysis of Second Dataset: Location dataset.
13. 13
Observation:
In first row we can see scatter plot of “longitude vs latitude” which gives location of
properties. We can also observe four cluster of regions.
when we group data by Date, and count value we find there are 59 unique location and
for each location 56 months of data is given.
We have data in monthly manner so we have to convert it in yearly format.
Steps:
1) First, we will create datasets for 59 locations.
2) We classify location in region.
3) We have to find average Sunshine hours, average Price and average wind energy of each
location for yearly manner.
4) Then we predict energy for each region.
Finding average Sunshine Hours:
For each location, we add all sun hours values of 56 months and divide by 56 that give
average sunshine hours per month. Then we multiply into 12 so we get average value
for one year.
Predicting Regions:
We have only two features [‘Longitude’, ‘Latitude’] and no labels; that is why we
choose unsupervised learning for classification.
We are preferring K-means clustering algorithm.
9.2 Clustering
Clustering is a problem of learning to assign a label to examples by leveraging an
unlabelled dataset. Because the dataset is completely unlabelled, deciding on
whether the learned model is optimal is much more complicated than in supervised
learning.
14. 14
There is a variety of clustering algorithms, and, unfortunately, it’s hard to tell which one is
better in quality for your dataset. Usually, the performance of each algorithm depends on
the unknown properties of the probability distribution the dataset was drawn from. In this
Chapter, I outline the most useful and widely used clustering algorithms. (Burkov, 2020)
9.2.1 K-Means
The k-means clustering algorithm works as follows. First, you choose k — the number of
clusters. Then you randomly put k feature vectors, called centroids, to the feature space.
We then compute the distance from each example x to each centroid c using some metric,
like the Euclidean distance. Then we assign the closest centroid to each example (like if we
labelled each example with a centroid id as the label). For each centroid, we calculate the
average feature vector of the examples labelled with it. These average feature vectors become
the new locations of the centroids.
We recompute the distance from each example to each centroid, modify the assignment and
repeat the procedure until the assignments don’t change after the centroid locations were
recomputed. The model is the list of assignments of centroids IDs to the examples.
The initial position of centroids influences the final positions, so two runs of k-means can
15. 15
result in two different models. Some variants of k-means compute the initial positions of
centroids based on some properties of the dataset.
One run of the k-means algorithm is illustrated in Figure 2. The circles in Figure 2 are
two-dimensional feature vectors; the squares are moving centroids. Different background
colours represent regions in which all points belong to the same cluster.
The value of k, the number of clusters, is a hyperparameter that has to be tuned by the
data analyst. There are some techniques for selecting k. None of them is proven optimal.
Most of those techniques require the analyst to make an “educated guess” by looking at some
metrics or by examining cluster assignments visually.
9.2.3 Determining the Number of Clusters
The most important question is how many clusters does your dataset have? When the feature
vectors are one-, two- or three-dimensional, you can look at the data and see “clouds” of
points in the feature space. Each cloud is a potential cluster. However, for D-dimensional
data, with D > 3, looking at the data is problematic.
One way of determining the reasonable number of clusters is based on the concept of
prediction strength. The idea is to split the data into training and test set, similarly to how we
do in supervised learning. Once you have the training and test sets, Str of size Ntr and Ste of
size N respectively, you fix k, the number of clusters, and run a clustering algorithm C on sets
Str and Ste and obtain the clustering results C (Str, k) and C (Ste, k).
Let A be the clustering C (Str, k) built using the training set. The clusters in A can be seen as
regions. If an example falls within one of those regions, then that example belongs to
some specific cluster. For example, if we apply the k-means algorithm to some dataset, it
results in a partition of the feature space into k polygonal regions, as we saw in Figure 2.
Define the N× N co-membership matrix D[A, Ste] as follows: D[A, Ste](i,i′ ) = 1 if and only
if examples xi and xi′ from the test set belong to the same cluster according to the clustering
A. Otherwise D[A, Ste](i,i′) = 0.
Let’s take a break and see what we have here. We have built, using the training set of
examples, a clustering A that has k clusters. Then we have built the co-membership matrix
that indicates whether two examples from the test set belong to the same cluster in A.
Intuitively, if the quantity k is the reasonable number of clusters, then two examples that
belong to the same cluster in clustering C (Ste, k) will most likely belong to the same cluster
16. 16
in clustering C (Str, k). On the other hand, if k is not reasonable (too high or too low), then
training data-based and test data-based clustering will likely be less consistent.
Another effective method to estimate the number of clusters is the gap statistic method.
Other, less automatic methods, which some analysts still use, include the elbow method
and the average silhouette method.
Experiments suggest that a reasonable number of clusters is the largest k such that ps(k) is
above 0.8. You can see in Figure 5 examples of predictive strength for different values of k for
two, three- and four-cluster data.
For non-deterministic clustering algorithms, such as k-means, which can generate different
clustering depending on the initial positions of centroids, it is recommended to do multiple runs
of the clustering algorithm for the same k and compute the average prediction strength ̄ps(k)
over multiple runs.
Implementing k-means clustering.
17. 17
We observe Region column as categorical column and we can try to convert it into binary data
by Feature Engineering
Feature Engineering
When a product manager tells you “We need to be able to predict whether a particular
customer will stay with us. Here are the logs of customers’ interactions with our product for
five years.” you cannot just grab the data, load it into a library and get a prediction. You
need to build a dataset first.
18. 18
Remember from the first chapter that the dataset is the collection of labeled examples
{(xi, yi)} Ni=1. Each element xi among N is called a feature vector. A feature vector is a
vector in which each dimension j = 1, . . ., D contains a value that describes the example
somehow. That value is called a feature and is denoted as x(j).
The problem of transforming raw data into a dataset is called feature engineering. For
most practical problems, feature engineering is a labour-intensive process that demands from
the data analyst a lot of creativity and, preferably, domain knowledge.
For example, to transform the logs of user interaction with a computer system, one could
create features that contain information about the user and various statistics extracted from
the logs. For each user, one feature would contain the price of the subscription; other features
would contain the frequency of connections per day, week and year. Another feature would
contain the average session duration in seconds or the average response time for one request,
and so on. Everything measurable can be used as a feature. The role of the data analyst is to
create informative features: those would allow the learning algorithm to build a model that
predicts well labels of the data used for training. Highly informative features are also called
features with high predictive power. For example, the average duration of a user’s session
has high predictive power for the problem of predicting whether the user will keep using the
application in the future.
We say that a model has a low bias when it predicts the training data well. That is, the
model makes few mistakes when we use it to predict labels of the examples used to build the
model.
5.1.1 One-Hot Encoding
Some learning algorithms only work with numerical feature vectors. When some feature in
your dataset is categorical, like “colors” or “days of the week,” you can transform such a
categorical feature into several binary ones.
If your example has a categorical feature “colors” and this feature has three possible values:
“red,” “yellow,” “green,” you can transform this feature into a vector of three numerical
values:
red = [1, 0, 0]
yellow = [0, 1, 0]
green = [0, 0, 1]
By doing so, you increase the dimensionality of your feature vectors. You should not transform
red into 1, yellow into 2, and green into 3 to avoid increasing the dimensionality because that
would imply that there’s an order among the values in this category and this specific order is
important for the decision making. If the order of a feature’s values is not important, using
ordered numbers as values is likely to confuse the learning algorithm,1 because the algorithm
will try to find a regularity where there’s no one, which may potentially lead to overfitting.
(Burkov, 2020)
21. 21
South-west has highest count of location.
2.) How much each region generating :
:
North-east has highest energy of location.
3.) Finding cheapest prices for each region:
22. 22
South west have cheapest locations
Now we create data frame for each region.
23. 23
Scenario: 1
We want to pick location which have cheap prices and best optimal Energy. Which can
generate 2 million kwh/a energy and our budget is 2 million.
total energy = ( energy * m^2) + ( energy * m^2) + ( energy * m^2) + ( energy * m^2)
= (238.62 * 3000) + ( 236.11 * 2000) + (242.04 *3000) + (239.78 *2000)
= 2393760 kwh/a > 2 million.
Total cost = (price + cost) * m^2 for each region
= 1645000 < 2 millions
So, if we use all the area available to us and build their plant we can get enegy more than
2 million kwh/a and budget will be 1.6 million.
But we can optimize it further by using only half of the land where price are heigest.
24. 24
We can generate more than 2 million kwh/a in minimum budget of 1384000 Euro.
Locations for Scenario 1:
One can using this method also to predict scenario 2 where one must take location in
region where highest energy is generated.
But for scenario 2 we will try to use method of Hierarchical clustering .
25. 25
Scenario 2
Hierarchical clustering:
In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two categories:
Agglomerative: This is a "bottom-up" approach: Each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top-down" approach: All observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a dendrogram.
27. 27
After implementing same method for all region datasets, we get this result.
Observations:
1 From above result we want to find points that generates Highest energy and also
cheap price.
2 First, we will choose Optimal point for all the regions, we can see that second
highest energy point is cheap compare to highest point.
But from the calculation we can see it is not fulfilling the demand of 3 million kwh/a.
Now we will choose the point that have highest value of energy.
28. 28
Although we are using highest energy point we are not fulfilling energy demand. Scenario
2 will be not feasible.
Conclusion
29. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 29/62
Code :
Project to find best location for Solar Farms
that can fullfill our Energy Requiement .
In [476… # import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [477… #import Datasets
df = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectInstalled Solar Plants.xls
df_1 = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectEnvironment Solar Data.x
In [478… df.head()
Out[478]: Model ID Sunshine Hours per year Size Solar Panel m2 Generated energy kWh/a
0 1 1418 794 233616
1 2 1474 1726 525410
2 3 1335 5776 1612292
3 4 1224 6494 1681651
4 5 1320 2085 576313
In [479… df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Model ID 19 non-null int64
1 Sunshine Hours per year 19 non-null int64
2 Size Solar Panel m2 19 non-null int64
3 Generated energy kWh/a 19 non-null int64
dtypes: int64(4)
memory usage: 736.0 bytes
Obsevation : See that we want to find
Energy per m^2 ..so we can find it if we
devide Genrated energy / Solar panel
m2 ""
30. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 30/62
In [480… df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel
m2'] In [481… df.head()
Out[481]: Model Sunshine Hours per Size Solar Panel Generated energy energy_per_m2
ID year m2 kWh/a
0 1 1418 794 233616 294.226700
1 2 1474 1726 525410 304.409038
2 3 1335 5776 1612292 279.136427
3 4 1224 6494 1681651 258.954573
4 5 1320 2085 576313 276.409113
EDA of installed plant dataset.
In [482… sns.pairplot(df)
Out[482]: <seaborn.axisgrid.PairGrid at 0x21012d5f6d0>
31. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 31/62
Observation :
1) Enregy per m^2 is directly propsnal to Sunshine Hours per year.
In [483… # Let's take close look by jointplot.
sns.jointplot(x='Sunshine Hours per year',y='energy_per_m2',data = df)
Out[483]: <seaborn.axisgrid.JointGrid at 0x21012d5fa30>
Observation :
From the straight line we can think about
implemanting linear regration model.
In [484… #Let's see how much is it corelating..
#we find corelation and plot it with heatmap.
sns.heatmap(df.corr(),annot = True)
Out[484]: <AxesSubplot:>
32. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 32/62
Model Implementaion:
In [485… # creating Train and Test data:
X = df[[ 'Sunshine Hours per year']]
y = df['energy_per_m2']
In [486… # make a Split in the Datasets and importing Linear Regression model and fiting on
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_sta
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Out[486]: LinearRegression()
Result of Regression
In [487… # intercept is value of C in y = mx + c
print(lm.intercept_)
36.40591577945423
In [488… # coefficent is slop of the line :
33. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 33/62
lm.coef_
Out[488]: array([0.18182098])
In [489… #Let's check predicted values
predictions = lm.predict(X_test)
predictions
Out[489]: array([258.95479913, 250.04557096, 304.41004491, 279.13692826,
294.22806986, 248.22736112, 260.40936699, 255.13655848])
In [490… #We can check how it varries from actual values by plotting scatter plot.
plt.scatter(y_test,predictions)
Out[490]: <matplotlib.collections.PathCollection at 0x21015f39ee0>
Observation:
From graph we can see there is almost no deveation
from y_test & prediction .
In [491… from sklearn import metrics
metrics.mean_absolute_error(y_test,predictions)
Out[491]: 0.00046171013969242836
In [492… from sklearn.metrics import r2_score
r2_score(y_test, predictions)
Out[492]: 0.9999999989429613
From result we can see that there is almost zero error in our model and R2 score is also
nearly 1. which is best possible outcome.
37. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 37/62
Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices
Date
2020-12-01 59 59 59 59 59
2021-01-01 59 59 59 59 59
2021-02-01 59 59 59 59 59
2021-03-01 59 59 59 59 59
2021-04-01 59 59 59 59 59
2021-05-01 59 59 59 59 59
2021-06-01 59 59 59 59 59
2021-07-01 59 59 59 59 59
2021-08-01 59 59 59 59 59
2021-09-01 59 59 59 59 59
2021-10-01 59 59 59 59 58
2021-11-01 59 59 59 59 59
2021-12-01 59 59 59 59 59
2022-01-01 59 59 59 59 59
2022-02-01 59 59 59 59 59
2022-03-01 59 59 59 59 59
2022-04-01 59 59 59 59 59
2022-05-01 59 59 59 59 59
2022-06-01 59 59 59 59 59
2022-07-01 59 59 59 59 59
2022-08-01 59 59 59 59 59
we can from above two result that there are 59 uniqe properties's data is available to
us . Also from pair plot we can see properties devided into 4 Clusters.
In datasets we can see data for 56 months so we have to convert it in yearly format.
Price and Wind Energy are reamaining same thorought time period for each
prpoperty. We will create dataset for 59 properties.
In [498… # finding uniqe properties.
locations_ = df_1[['Longitude','Latitude']].drop_duplicates()
locations_
46. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 46/62
44 0.081 0.805 94.259643 3.617357 98.0
45 0.124 0.413 93.306786 3.485571 73.0
now We have dataset of 59 properties but we do not know
in which region / Area they are.
In [507… # we plot the longatude vs latitude so we can see the properties.
Reg = np.array(locations_[['Longitude','Latitude']])
plt.scatter(Reg[:,0],Reg[:,1])
Out[507]: <matplotlib.collections.PathCollection at 0x2101ea7df10>
we can see there are 4 region . we do not have any labels available so we have to
use unsupervised learning for prediction of the clusters.
We will use K-means algorithm for clustering
In [508… from sklearn.cluster import KMeans
Kmeans = KMeans(n_clusters = 4)
Kmeans.fit(Reg)
Out[508]: KMeans(n_clusters=4)
In [509… Reg_list = Kmeans.labels_
Reg_list
Out[509]: array([3, 3, 3, 3, 0, 3, 0, 0, 0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 3, 3, 3, 0,
0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2,
2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2])
In [510… #from these we can define which region are which.
Kmeans.cluster_centers_
47. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 47/62
Out[510]: array([[0.2415 , 0.22814286],
[0.71328571, 0.70671429],
[0.73646154, 0.24076923],
[0.19594444, 0.72355556]])
In [511… fig,(ax1) = plt.subplots(1, sharey=True, figsize = (5,5))
ax1.set_title('Saperate Regions using Kmeans')
ax1.scatter(Reg[:,0],Reg[:,1],c= Kmeans.labels_,cmap='rainbow')
Out[511]: <matplotlib.collections.PathCollection at 0x2101eacacd0>
In [512… locations_['region'] = Reg_list
In [513… locations_.head()
Out[513]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region
50 0.014 0.729 96.785357 3.608679 137.0 3
51 0.025 0.472 92.680357 3.552107 57.0 3
48 0.064 0.835 96.747500 3.554196 87.0 3
44 0.081 0.805 94.259643 3.617357 98.0 3
45 0.124 0.413 93.306786 3.485571 73.0 0
In [514… #defining Region using Hot-Encoding.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df =
pd.DataFrame(encoder.fit_transform(locations_[['region']]).toarray()) locations_1
= locations_.join(encoder_df)
48. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 48/62
locations_1.columns = ['Longitude','Latitude','Avg Sunhine Hours','Avg. Wind
Speed 'SW','SE']
locations_1.head()
Out[514]: Longitude Latitude Avg Sunhine Avg. Wind Avg region NE NW SW SE
Hours Speed prices
50 0.014 0.729 96.785357 3.608679 137.0 3 0.0 1.0 0.0 0.0
51 0.025 0.472 92.680357 3.552107 57.0 3 0.0 0.0 1.0 0.0
48 0.064 0.835 96.747500 3.554196 87.0 3 0.0 1.0 0.0 0.0
44 0.081 0.805 94.259643 3.617357 98.0 3 0.0 0.0 1.0 0.0
45 0.124 0.413 93.306786 3.485571 73.0 0 0.0 1.0 0.0 0.0
In [515… #defining Region by lamda function:
locations_['Area']= locations_['region'].apply(lambda region:"South-West" if region
In [516… locations_.head()
Out[516]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region Area
50 0.014 0.729 96.785357 3.608679 137.0 3 South-West
51 0.025 0.472 92.680357 3.552107 57.0 3 South-West
48 0.064 0.835 96.747500 3.554196 87.0 3 South-West
44 0.081 0.805 94.259643 3.617357 98.0 3 South-West
45 0.124 0.413 93.306786 3.485571 73.0 0 South-East
We want sunrise hours in yearly manner that's why we will multiply it by 12
In [517… locations_['Avg Sunshine hours'] = locations_['Avg Sunshine hours']*12
Now Predicting energy per one m^2 using coeficent and intercept
of Linear Regression
In [518… locations_['Pridected Energy'] = lm.coef_[0]*locations_['Avg Sunshine hours'] + lm
locations_.head()
49. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 49/62
Out[518]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
50 0.014 0.729 1161.424286 3.608679 137.0 3
South-
West
247.577221
51 0.025 0.472 1112.164286 3.552107 57.0 3
South-
West
238.620720
48 0.064 0.835 1160.970000 3.554196 87.0 3
South-
West
247.494623
44 0.081 0.805 1131.115714 3.617357 98.0 3
South-
West
242.066487
45 0.124 0.413 1119.681429 3.485571 73.0 0
South-
East
239.987494
Now we will group data by Region and do the analysis.
In [519… sns.countplot(x='Area',data = locations_)
Out[519]: <AxesSubplot:xlabel='Area', ylabel='count'>
In [520… locations_.groupby(['Area']).count()
Out[520]: Longitude Latitude Avg Sunshine Avg Wind Avg region Pridected
hours speed Price Energy
Area
North-
East
13 13 13 13 13 13 13
North-
West
14 14 14 14 14 14 14
51. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 51/62
In [524… sns.barplot(x = 'Area', y = 'Avg Price',data = locations_, estimator =
min) Out[524]: <AxesSubplot:xlabel='Area', ylabel='Avg Price'>
In [525… locations_.groupby(['Area'], sort=False)['Avg Price'].min()
Out[525]: Area
52. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 52/62
South-West 57.0
South-East 65.0
North-West 74.0
North-East 61.0
Name: Avg Price, dtype: float64
In [526… NW_ = locations_[locations_['Area'] == 'North-West']
NE_ = locations_[locations_['Area'] == 'North-East']
SE_ = locations_[locations_['Area'] == 'South-East']
SW_ = locations_[locations_['Area'] == 'South-West']
For scenario 1 we will see how we can optimize cost
and energy
In [527… NW_
Out[527]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
57 0.567 0.727 1118.584286 3.624429 74.0 1
North-
West
239.788010
41 0.617 0.514 1136.922857 3.635679 149.0 1
North-
West
243.122347
22 0.630 0.690 1282.220357 3.204429 102.0 1
North-
West
269.540482
21 0.640 0.680 1286.622321 3.232714 318.0 1
North-
West
270.340851
23 0.680 0.700 1304.996786 3.212714 131.0 1
North-
West
273.681714
29 0.700 0.740 1282.432500 3.235286 248.0 1
North-
West
269.579054
20 0.720 0.700 1136.828571 3.706714 128.0 1
North-
West
243.105204
28 0.720 0.760 1281.752679 3.206571 320.0 1
North-
West
269.455448
27 0.740 0.770 1265.851607 3.174714 232.0 1
North-
West
266.564299
26 0.760 0.720 1266.314464 20.975714 245.0 1
North-
West
266.648457
25 0.770 0.750 1272.837857 3.321429 129.0 1
North-
West
267.834546
24 0.770 0.800 1270.918929 3.137429 330.0 1
North-
West
267.485645
53 0.808 0.505 1146.295714 3.644357 108.0 1
North-
West
244.826530
43 0.864 0.838 1087.371429 3.610607 82.0 1
North-
West
234.112858
In [528… # we wiil sort value of colums price and energy and choose the row where price is
NW_sorted = NW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
53. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 53/62
best_row_nw = NW_sorted.head(1)
best_row_nw
Avg Sunshine Avg Wind Avg Pridected
Out[528]: Longitude Latitude region Area
hours speed Price Energy
North-
57 0.567 0.727 1118.584286 3.624429 74.0 1 239.78801
West
In [529… SE_
Out[529]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
45 0.124 0.413 1119.681429 3.485571 73.0 0
South-
East
239.987494
49 0.160 0.173 1127.485714 3.630857 96.0 0
South-
East
241.406477
46 0.164 0.106 1145.837143 3.708321 90.0 0
South-
East
244.743152
4 0.180 0.300 1160.554286 3.676500 126.0 0
South-
East
247.419037
7 0.200 0.200 1121.288571 3.597750 95.0 0
South-
East
240.279706
9 0.210 0.210 1130.082857 3.478821 127.0 0
South-
East
241.878692
0 0.220 0.270 1156.928571 3.680839 67.0 0
South-
East
246.759806
8 0.230 0.250 1291.221964 3.360143 326.0 0
South-
East
271.177163
1 0.280 0.230 1131.634286 3.540536 116.0 0
South-
East
242.160774
2 0.280 0.270 1139.567143 3.565125 91.0 0
South-
East
243.603134
6 0.300 0.250 1163.729455 3.511929 130.0 0
South-
East
247.996349
5 0.310 0.320 1180.821429 3.593732 86.0 0
South-
East
251.104029
3 0.350 0.200 1130.978571 3.615589 65.0 0
South-
East
242.041552
42 0.373 0.002 1118.151429 3.567857 107.0 0
South-
East
239.709308
In [530… SE_sorted = SE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
best_row_se = SE_sorted.head(1)
best_row_se
Out[530]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
South-
54. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 54/62
3 0.35 0.2 1130.978571 3.615589 65.0 0 242.041552
East
In [531… SW_
Out[531]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
50 0.014 0.729 1161.424286 3.608679 137.0 3
South-
West
247.577221
51 0.025 0.472 1112.164286 3.552107 57.0 3
South-
West
238.620720
48 0.064 0.835 1160.970000 3.554196 87.0 3
South-
West
247.494623
44 0.081 0.805 1131.115714 3.617357 98.0 3
South-
West
242.066487
47 0.137 0.710 1106.447143 3.597911 72.0 3
South-
West
237.581223
58 0.180 0.611 1135.157143 3.682125 115.0 3
South-
West
242.801303
14 0.180 0.800 1255.543393 3.243714 312.0 3
South-
West
264.690050
17 0.200 0.770 1276.420179 3.097286 159.0 3
South-
West
268.485888
19 0.210 0.740 1127.112857 3.607393 73.0 3
South-
West
241.338684
10 0.220 0.700 1303.584107 3.196000 276.0 3
South-
West
273.424860
18 0.230 0.760 1266.811071 3.140571 137.0 3
South-
West
266.738750
56 0.233 0.623 1146.467143 3.668304 112.0 3
South-
West
244.857699
40 0.233 0.929 1144.234286 3.546321 93.0 3
South-
West
244.451719
11 0.280 0.680 1292.041607 3.168286 105.0 3
South-
West
271.326191
12 0.280 0.690 1286.694643 3.327143 224.0 3
South-
West
270.354001
16 0.300 0.720 1269.559286 3.245143 174.0 3
South-
West
267.238433
15 0.310 0.750 1285.228929 3.214571 304.0 3
South-
West
270.087503
13 0.350 0.700 1261.748571 3.179000 273.0 3
South-
West
265.818281
In [532… SW_sorted = SW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
best_row_sw = SW_sorted.head(1)
best_row_sw
Out[532]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
55. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 55/62
South-
51 0.025 0.472 1112.164286 3.552107 57.0 3 238.62072
West
In [533… NE_
Out[533]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
32 0.630 0.280 1336.118571 3.604982 149.0 2
North-
East
279.340308
31 0.640 0.260 1147.204286 3.812143 139.0 2
North-
East
244.991727
33 0.680 0.250 1114.937143 3.448929 137.0 2
North-
East
239.124883
39 0.700 0.210 1112.207143 3.578625 93.0 2
North-
East
238.628512
52 0.715 0.211 1167.107143 3.673768 139.0 2
North-
East
248.610484
30 0.720 0.220 1315.660909 3.258857 277.0 2
North-
East
275.620676
38 0.720 0.230 1144.800000 3.606429 134.0 2
North-
East
244.554577
37 0.740 0.200 1125.844286 3.495375 74.0 2
North-
East
241.108031
36 0.760 0.300 1148.657143 3.705429 117.0 2
North-
East
245.255887
34 0.770 0.180 1098.385714 3.697875 61.0 2
North-
East
236.115486
35 0.770 0.310 1158.064286 3.590196 150.0 2
North-
East
246.966303
55 0.856 0.100 1108.628571 3.583125 99.0 2
North-
East
237.977853
54 0.873 0.379 1158.085714 3.588218 92.0 2
North-
East
246.970199
In [534… NE_sorted = NE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
best_row_ne = NE_sorted.head(1)
best_row_ne
Out[534]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
North-
34 0.77 0.18 1098.385714 3.697875 61.0 2 236.115486
East
In [535… Scenari0_1 = pd.concat([best_row_nw, best_row_se, best_row_sw, best_row_ne],
axis=0 In [536… Scenari0_1
56. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 56/62
Out[536]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
57 0.567 0.727 1118.584286 3.624429 74.0 1
North-
West
239.788010
3 0.350 0.200 1130.978571 3.615589 65.0 0
South-
East
242.041552
51 0.025 0.472 1112.164286 3.552107 57.0 3
South-
West
238.620720
34 0.770 0.180 1098.385714 3.697875 61.0 2
North-
East
236.115486
In [537… # Calculate Energy
Total_Energy = (238.620720 * 3000) + (236.115486 * 2000) + (242.041552 * 2000) +
(2 In [538… Total_Energy = round(Total_Energy)
Total_Energy
Out[538]: 2031858
In [539… #Cost
Total_cost = round((3000 * 157) + (2000 * 161) + (2000 * 165) + (1500 *
174)) Total_cost
Out[539]: 1384000
In [540… print(" Total Energy genrated in KWH/a:",Total_Energy)
print(" Total cost in Euro:",Total_cost)
print(" total area we occupying : 8500 m^2")
Total Energy genrated in KWH/a: 2031858
Total cost in Euro: 1384000
total area we occupying : 8500 m^2
Scenario 2
We can also use above method to solve scenario 2 but we
will try it using Hierarchical clustering :
In [541… from sklearn.cluster import AgglomerativeClustering
# Extract the two columns of features that you want to use for clustering
NW_copmare = NW_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_NW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_NW.fit(NW_copmare)
# Predict the clusters for each data point
pred = cluster_NW.fit_predict(NW_copmare)
# Create a scatter plot of the clusters
57. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 57/62
plt.scatter(NW_copmare['Avg Price'], NW_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
In [542… pred
Out[542]: array([3, 0, 1, 2, 1, 4, 0, 2, 4, 4, 1, 2, 0, 3], dtype=int64)
In [543… SE_copmare = SE_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_SE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_NW.fit(SE_copmare)
# Predict the clusters for each data point
pred = cluster_SE.fit_predict(SE_copmare)
# Create a scatter plot of the clusters
plt.scatter(SE_copmare['Avg Price'], SE_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
58. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 58/62
In [544… SW_copmare = SW_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_SW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_SW.fit(SW_copmare)
# Predict the clusters for each data point
pred = cluster_SW.fit_predict(SW_copmare)
# Create a scatter plot of the clusters
plt.scatter(SW_copmare['Avg Price'], SW_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
59. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 59/62
In [545… NE_copmare = NE_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_NE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_NE.fit(NE_copmare)
# Predict the clusters for each data point
pred = cluster_NE.fit_predict(NE_copmare)
# Create a scatter plot of the clusters
plt.scatter(NE_copmare['Avg Price'], NE_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
60. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 60/62
WE will use all the optimal point which have best energy and cheap price
In [546… # SW + NW + NE + SE
Energy_op = (252*2000) + (272 * 3000) + (273 * 3000) + (280 *2000)
Energy_op
Out[546]: 2699000
Now we wil use point which have highest energy
In [547… Energy_hi = (271*2000) + (273.5 * 3000) + (273 * 3000) + (280 *2000)
Energy_hi
Out[547]: 2741500.0
In [548… cost = (376*3000) + (249*2000) + (231 * 3000) + (426*2000)
cost
Out[548]: 3171000
61. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 61/62
Conclusion
Scenario 1
We can full fil the demand of 2 milllion kwh/a energy Easily and we
dont even need 2 million budget.
We are also not using all the area that proposed and area with highest
price of the cheapest price "North West " we are just using half land
there. so it is saving money.
Scenario 2
First we used the point that are second highest energy and cheap compare to
highest energy point, but we are not full fiiling energy demand
Now we are taking the point which have highest energy without
cosidering cost . although it is not full filling demand. We need more
land and little more money.
62. 1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 62/62
Bibliography
Burkov, A. (2020). The Hundred -Page machine learning book.