SlideShare a Scribd company logo
1 of 62
Download to read offline
1
Project Report on
PREDICTION OF BEST LOCATION FOR SOLAR
FARM IN ORDER TO MEET ENERGY DEMAND
AND COMPANY PROFIT.
Submi ed by
SHRUTEJ JARIWALA
PARSHWA BHAVSAR
VIRAL SUREJA
VISHNUVARDHAN CHOWDARY
SUBJECT – AI BASICS
PROF: JEAN-MICHEL TAVERNE
2
1. Problem Summery
Finding location for solar farm, which can fulfil customer need of energy demand:
1) 2 million kWh/a. Dataset
2) 3 million kWh/a
Limitations:
The building space in the regions is limited so may be built in the regions max following area:
-North-West: 3,000m2
-North-East: 3,000m2
-South-West: 2,000 m2
-South-East: 2.000m2
For one square meter of solar plant Smart Energy LLC has to pay 100€ for the material plus
the cost of the land. -2 million € for scenario -1 and -In order to fulfil scenario 2, a budget of 3
million € can be invested.
Objective
We have to find best solar farm location which can fulfil the energy need of the
consumer along with Company profit. For that we want to apply machine learning
techniques in order to find solution of this problem.
1.1. What is Machine Learning?
Machine learning is a subfield of computer science that is concerned with building algorithms
which, to be useful, rely on a collection of examples of some phenomenon. These examples
can come from nature, be handcrafted by humans or generated by another algorithm.
Machine learning can also be defined as the process of solving a practical problem by
1) gathering a dataset,
2) algorithmically building a statistical model based on that dataset.
That statistical model is assumed to be used somehow to solve the practical problem.
To save keystrokes, I use the terms “learning” and “machine learning” interchangeably.
(Burkov, 2020)
Types of learning can be supervised, semi-supervised, unsupervised and reinforcement.
1.2. Supervised Learning
In supervised learning1, the dataset is the collection of labeled examples {(xi, yi)}N
i=1.Each element xi among N is called a feature vector. A feature vector is a vector in
which each dimension j = 1, . . ., D contains a value that describes the example
somehow. That value is called a feature and is denoted as x(j). For instance, if each
example x in our collection represents a person, then the first feature, x(1), could contain
height in cm, the second feature, x(2), could contain weight in kg, x(3) could contain
3
gender, and so on. For all examples in the dataset, the feature at position j in the feature
vector always contains the same kind of information. It means that if x(2) i contains
weight in kg in some example xi,then x(2) k will also contain weight in kg in every
example x k, k = 1, . . . , N . The label yi can be either an element belonging to a finite
set of classes {1, 2, . . ., C}, or a real number, or a more complex structure, like a vector,
a matrix, a tree, or a graph. Unless otherwise stated, is either one of a finite set of classes
or a real number2. You can see a class as a category to which an example belongs.
For instance, if your examples are email messages and your problem is spam
detection, then you have two classes {spam, not spam}. The goal of a supervised
learning algorithm is to use the dataset to produce a model that takes a feature vector x
as input and outputs information that allows deducing the label for this feature vector.
For instance, the model created using the dataset of people could take as input a
feature vector describing a person and output a probability that the person
has cancer.
1.3. Unsupervised Learning
In unsupervised learning, the dataset is a collection of unlabelled examples {xi}N i=1.
Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to
create a model that takes a feature vector x as input and either transforms it into
another vector or into a value that can be used to solve a practical problem. For
example, in clustering, the model returns the id of the cluster for each feature vector
in the dataset. In dimensionality reduction, the output of the model is a feature vector
that has fewer features than the input x; in outlier detection, the output is a real
number that indicates how x is different from a “typical” example in the dataset.
1.4 Reinforcement Learning
Reinforcement learning is a subfield of machine learning where the machine “lives”
in an environment and is capable of perceiving the state of that environment as a
vector of features. The machine can execute actions in every state. Different actions
bring different rewards and could also move the machine to another state of the
environment.
2. Datasets:
 Installed Solar plants
 New locations data sets.
We have two datasets; first one “installed Solar plants” has data of 20 of already
installed power plant’s data, which gives insight on every plant’s sunshine hours per
year, solar panel m^2 and value of generated energy in kwh/a.
Second data set has data of 56 months start from January 2018 to August 22 for 59 new
unique location, which also gives sunrise hours, price per m^2, average wind speed.
4
3. Requirements:
 Python
 IDE: Jupiter notebook
 Libraries: Pandas, NumPy, matplotlib, seaborn , scikit-learn
3.1. Libraries:
 import pandas as pd - pandas is a popular Python-based data analysis toolkit which
can be imported using import pandas as pd. It presents a diverse range of utilities,
ranging from parsing multiple file formats to converting an entire data table into a
NumPy matrix array. This makes pandas a trusted ally in data science and machine
learning. Similar to NumPy, pandas deal primarily with data in 1-D and 2-D arrays;
however, pandas handle the two differently
 import matplotlib. pyplot as plt - matplotlib. pyplot is stateful, in that it keeps track
of the current figure and plotting area, and the plotting functions are directed to the
current axes and can be imported using import matplotlib. pyplot as plt.
 import seaborn as sns - Seaborn is a library for making statistical graphics in Python.
It builds on top of matplotlib and integrates closely with pandas’ data structures.
Seaborn helps you explore and understand your data. Its plotting functions operate on
data frames and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots. Its dataset-
oriented, declarative API lets you focus on what the different elements of your plots
mean, rather than on the details of how to draw them
 import NumPy as np - NumPy provides a large set of numeric datatypes that you can
use to construct arrays. NumPy tries to guess a datatype when you create an array, but
functions that construct arrays usually also include an optional argument to explicitly
specify the datatype.
4. Initial thoughts and Observation
 We can observe that both datasets have one common column have “sunrise
hour”.
 For every region we have limited m^2 per area, if we can find out how
much energy is generated in one m^2 area in every region we can know the
how much energy can be generated.
 We also want to find out if there is any corelation between sunshine hour
and energy per m^2.
5. Solution process:
6.1 importing data:
Import data using pandas library.
5
We divided Generated energy by size of solar panel area m^2,then we find energy for
one meter square area.
6.2 Data exploration : installed plant data
From the use of seaborn library, we plotted pair plot of the data
df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel m2']
sns.pairplot(df)
6
We try to find out how much sunrise hour per year column is corelated to
energy per meter^2 column.
Observation:
1) Energy per m^2 is directly propotional to Sunshine Hours per year.
2) Solar panel m^2 is directly propotional to sunshine Hours per year.
#Let's see how much is it corelating..
#we find corelation and plot it with heatmap.
sns.heatmap(df.corr(),annot = True)
7
Observation:
1) Energy per meter^2 is completely dependent on sunshine hour per year .
2) We can predict energy per meter^2 by sunshine hours per year, So we can
apply regression model.
6.3 Classification vs. Regression
Classification is a problem of automatically assigning a label to an unlabelled example.
Spam detection is a famous example of classification. In machine learning, the
classification problem is solved by a classification learning algorithm that takes a
collection of labelled examples as inputs and produces a model that can take an
unlabelled example as input and either directly output a label or output a number that
can be used by the analyst to deduce the label. An example of such a number is a
probability.
In a classification problem, a label is a member of a finite set of classes. If the size of
the set of classes is two (“sick”/ “healthy”, “spam”/“not spam”), we talk about binary
classification (also called binomial in some sources). Multiclass classification (also
8
called multinomial) is a classification problem with three or more classes. While some
learning algorithms naturally allow for more than two classes, others are by nature
binary classification algorithms. There are strategies allowing to turn a binary
classification learning algorithm into a multiclass one.
Regression is a problem of predicting a real-valued label (often called a target) given
an unlabelled example. Estimating house price valuation based on house features, such
as area, the number of bedrooms, location and so on is a famous example of regression.
The regression problem is solved by a regression learning algorithm that takes a
collection of labelled examples as inputs and produces a model that can take an
unlabelled example as input and output a target. (Burkov, 2020)
6.4 Linear Regression
Linear regression is a popular regression learning algorithm that learns a model which
is a linear combination of features of the input example.
6.4.1 Problem Statement
We have a collection of labeled examples {(xi , yi)} N i=1, where N is the size of the
collection, xi is the D-dimensional feature vector of example i = 1, . . . , N, yi is a real-
valued1 target and every feature x (j) i , j = 1, . . . , D, is also a real number. We want to
build a model fw,b(x) as a linear combination of features of example :
x: fw, b(x) = wx + b,
where w is a D-dimensional vector of parameters and b is a real number. The notation
fw,b means that the model f is parametrized by two values: w and b. We will use the
model to predict the unknown y for a given x like this: y ← fw,b(x). Two models
parametrized by two different pairs (w, b) will likely produce two different predictions
when applied to the same example. We want to find the optimal values (w∗ , b∗ ).
Obviously, the optimal values of parameters define the model that makes the most
accurate predictions. You could have noticed that the form of our linear model in eq. 1
is very similar to the form of the SVM model. The only difference is the missing sign
operator. The two models are indeed similar. However, the hyperplane in the SVM plays
the role of the decision boundary: it’s used to separate two groups of examples from one
another. As such, it has to be as far from each group as possible. On the other hand, the
hyperplane in linear regression is chosen to be as close to all training examples as
possible. You can see why this latter requirement is essential by looking at the
illustration in Figure 1. It displays the regression line (in red) for one-dimensional
examples (blue dots). We can use this line to predict the value of the target ynew for a
new unlabelled input example xnew. If our examples are D-dimensional feature vectors
(for D > 1), the only difference with the one-dimensional case is that the regression
model is not a line but a plane or a hyperplane (for D > 2 ).
9
Now you see why it’s essential to have the requirement that the regression
hyperplane lies as close to the training examples as possible: if the red line in
Figure.1 was far from the blue dots, the prediction ynew would have fewer chances
to be correct.
6.4.2 Solution
To get this latter requirement satisfied, the optimization procedure which we use to
find the optimal values for w∗ and b∗ tries to minimize the following expression:
In mathematics, the expression we minimize or maximize is called an objective
function, or, simply, an objective. The expression (fw,b(xi) − yi)^2 in the above
objective is called the loss function.
It’s a measure of penalty for misclassification of example I. This particular choice of
the loss function is called squared error loss. All model-based learning algorithms
have a loss function and what we do to find the best model is we try to minimize the
objective known as the cost function. In linear regression, the cost function is given
by the average loss, also called the empirical risk. The average loss, or empirical
10
risk, for a model, is the average of all penalties obtained by applying the model to
the training data.
Why is the loss in linear regression a quadratic function? Why couldn’t we get the
absolute value of the difference between the true target yi and the predicted value f (xi)
and use that as a penalty? We could. Moreover, we also could use a cube instead of a
square.
we decided to use the linear combination of features to predict the target. However, we
could use a square or some other polynomial to combine the values of features. We
could also use some other loss function that makes sense: the absolute difference
between f (xi) and yi makes sense, the cube of the difference too; Sounds easy, doesn’t
it? However, do not rush to invent a new learning algorithm. The fact that the binary
loss (1 when f (xi) and yi are different and 0 when they are the same) also makes sense,
right? If we made different decisions about the form of the model, the form of the loss
function, and about the choice of the algorithm that minimizes the average loss to find
the best values of parameters, we would end up inventing a different machine learning
algorithm. (Burkov, 2020)
Implementing Linear Regression:
We took “Sunshine Hours” column as Feature and Energy per meter^2 as Label. Then
split the data in 60/40 ratio for creating train and test data. and we imported Liner
Regression model form scikit-learn library. Further we test data on remaining 40% data.
By predict function we predict energy from test data. And then we compare predicted
value to test labels. That is how we find out error function of our model.
Algorithm also gives us regression coefficient and intercept.
11
12
Observation:
From scatter plot we can observe the straight line. That shows little
deviation and great accuracy. Then we check the absolute mean error
and R^2 score.
6.5. Now we do analysis of Second Dataset: Location dataset.
13
Observation:
 In first row we can see scatter plot of “longitude vs latitude” which gives location of
properties. We can also observe four cluster of regions.
 when we group data by Date, and count value we find there are 59 unique location and
for each location 56 months of data is given.
 We have data in monthly manner so we have to convert it in yearly format.
Steps:
1) First, we will create datasets for 59 locations.
2) We classify location in region.
3) We have to find average Sunshine hours, average Price and average wind energy of each
location for yearly manner.
4) Then we predict energy for each region.
Finding average Sunshine Hours:
For each location, we add all sun hours values of 56 months and divide by 56 that give
average sunshine hours per month. Then we multiply into 12 so we get average value
for one year.
Predicting Regions:
We have only two features [‘Longitude’, ‘Latitude’] and no labels; that is why we
choose unsupervised learning for classification.
We are preferring K-means clustering algorithm.
9.2 Clustering
Clustering is a problem of learning to assign a label to examples by leveraging an
unlabelled dataset. Because the dataset is completely unlabelled, deciding on
whether the learned model is optimal is much more complicated than in supervised
learning.
14
There is a variety of clustering algorithms, and, unfortunately, it’s hard to tell which one is
better in quality for your dataset. Usually, the performance of each algorithm depends on
the unknown properties of the probability distribution the dataset was drawn from. In this
Chapter, I outline the most useful and widely used clustering algorithms. (Burkov, 2020)
9.2.1 K-Means
The k-means clustering algorithm works as follows. First, you choose k — the number of
clusters. Then you randomly put k feature vectors, called centroids, to the feature space.
We then compute the distance from each example x to each centroid c using some metric,
like the Euclidean distance. Then we assign the closest centroid to each example (like if we
labelled each example with a centroid id as the label). For each centroid, we calculate the
average feature vector of the examples labelled with it. These average feature vectors become
the new locations of the centroids.
We recompute the distance from each example to each centroid, modify the assignment and
repeat the procedure until the assignments don’t change after the centroid locations were
recomputed. The model is the list of assignments of centroids IDs to the examples.
The initial position of centroids influences the final positions, so two runs of k-means can
15
result in two different models. Some variants of k-means compute the initial positions of
centroids based on some properties of the dataset.
One run of the k-means algorithm is illustrated in Figure 2. The circles in Figure 2 are
two-dimensional feature vectors; the squares are moving centroids. Different background
colours represent regions in which all points belong to the same cluster.
The value of k, the number of clusters, is a hyperparameter that has to be tuned by the
data analyst. There are some techniques for selecting k. None of them is proven optimal.
Most of those techniques require the analyst to make an “educated guess” by looking at some
metrics or by examining cluster assignments visually.
9.2.3 Determining the Number of Clusters
The most important question is how many clusters does your dataset have? When the feature
vectors are one-, two- or three-dimensional, you can look at the data and see “clouds” of
points in the feature space. Each cloud is a potential cluster. However, for D-dimensional
data, with D > 3, looking at the data is problematic.
One way of determining the reasonable number of clusters is based on the concept of
prediction strength. The idea is to split the data into training and test set, similarly to how we
do in supervised learning. Once you have the training and test sets, Str of size Ntr and Ste of
size N respectively, you fix k, the number of clusters, and run a clustering algorithm C on sets
Str and Ste and obtain the clustering results C (Str, k) and C (Ste, k).
Let A be the clustering C (Str, k) built using the training set. The clusters in A can be seen as
regions. If an example falls within one of those regions, then that example belongs to
some specific cluster. For example, if we apply the k-means algorithm to some dataset, it
results in a partition of the feature space into k polygonal regions, as we saw in Figure 2.
Define the N× N co-membership matrix D[A, Ste] as follows: D[A, Ste](i,i′ ) = 1 if and only
if examples xi and xi′ from the test set belong to the same cluster according to the clustering
A. Otherwise D[A, Ste](i,i′) = 0.
Let’s take a break and see what we have here. We have built, using the training set of
examples, a clustering A that has k clusters. Then we have built the co-membership matrix
that indicates whether two examples from the test set belong to the same cluster in A.
Intuitively, if the quantity k is the reasonable number of clusters, then two examples that
belong to the same cluster in clustering C (Ste, k) will most likely belong to the same cluster
16
in clustering C (Str, k). On the other hand, if k is not reasonable (too high or too low), then
training data-based and test data-based clustering will likely be less consistent.
Another effective method to estimate the number of clusters is the gap statistic method.
Other, less automatic methods, which some analysts still use, include the elbow method
and the average silhouette method.
Experiments suggest that a reasonable number of clusters is the largest k such that ps(k) is
above 0.8. You can see in Figure 5 examples of predictive strength for different values of k for
two, three- and four-cluster data.
For non-deterministic clustering algorithms, such as k-means, which can generate different
clustering depending on the initial positions of centroids, it is recommended to do multiple runs
of the clustering algorithm for the same k and compute the average prediction strength ̄ps(k)
over multiple runs.
Implementing k-means clustering.
17
We observe Region column as categorical column and we can try to convert it into binary data
by Feature Engineering
Feature Engineering
When a product manager tells you “We need to be able to predict whether a particular
customer will stay with us. Here are the logs of customers’ interactions with our product for
five years.” you cannot just grab the data, load it into a library and get a prediction. You
need to build a dataset first.
18
Remember from the first chapter that the dataset is the collection of labeled examples
{(xi, yi)} Ni=1. Each element xi among N is called a feature vector. A feature vector is a
vector in which each dimension j = 1, . . ., D contains a value that describes the example
somehow. That value is called a feature and is denoted as x(j).
The problem of transforming raw data into a dataset is called feature engineering. For
most practical problems, feature engineering is a labour-intensive process that demands from
the data analyst a lot of creativity and, preferably, domain knowledge.
For example, to transform the logs of user interaction with a computer system, one could
create features that contain information about the user and various statistics extracted from
the logs. For each user, one feature would contain the price of the subscription; other features
would contain the frequency of connections per day, week and year. Another feature would
contain the average session duration in seconds or the average response time for one request,
and so on. Everything measurable can be used as a feature. The role of the data analyst is to
create informative features: those would allow the learning algorithm to build a model that
predicts well labels of the data used for training. Highly informative features are also called
features with high predictive power. For example, the average duration of a user’s session
has high predictive power for the problem of predicting whether the user will keep using the
application in the future.
We say that a model has a low bias when it predicts the training data well. That is, the
model makes few mistakes when we use it to predict labels of the examples used to build the
model.
5.1.1 One-Hot Encoding
Some learning algorithms only work with numerical feature vectors. When some feature in
your dataset is categorical, like “colors” or “days of the week,” you can transform such a
categorical feature into several binary ones.
If your example has a categorical feature “colors” and this feature has three possible values:
“red,” “yellow,” “green,” you can transform this feature into a vector of three numerical
values:
red = [1, 0, 0]
yellow = [0, 1, 0]
green = [0, 0, 1]
By doing so, you increase the dimensionality of your feature vectors. You should not transform
red into 1, yellow into 2, and green into 3 to avoid increasing the dimensionality because that
would imply that there’s an order among the values in this category and this specific order is
important for the decision making. If the order of a feature’s values is not important, using
ordered numbers as values is likely to confuse the learning algorithm,1 because the algorithm
will try to find a regularity where there’s no one, which may potentially lead to overfitting.
(Burkov, 2020)
19
Implementing hot-encoding :
We can also assign region also by Lambda function using if else syntax.
20
Data analysis by regions:
1) Locations counts by region:
21
 South-west has highest count of location.
2.) How much each region generating :
:
 North-east has highest energy of location.
3.) Finding cheapest prices for each region:
22
 South west have cheapest locations
Now we create data frame for each region.
23
Scenario: 1
We want to pick location which have cheap prices and best optimal Energy. Which can
generate 2 million kwh/a energy and our budget is 2 million.
total energy = ( energy * m^2) + ( energy * m^2) + ( energy * m^2) + ( energy * m^2)
= (238.62 * 3000) + ( 236.11 * 2000) + (242.04 *3000) + (239.78 *2000)
= 2393760 kwh/a > 2 million.
Total cost = (price + cost) * m^2 for each region
= 1645000 < 2 millions
So, if we use all the area available to us and build their plant we can get enegy more than
2 million kwh/a and budget will be 1.6 million.
But we can optimize it further by using only half of the land where price are heigest.
24
We can generate more than 2 million kwh/a in minimum budget of 1384000 Euro.
Locations for Scenario 1:
One can using this method also to predict scenario 2 where one must take location in
region where highest energy is generated.
But for scenario 2 we will try to use method of Hierarchical clustering .
25
Scenario 2
Hierarchical clustering:
In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two categories:
 Agglomerative: This is a "bottom-up" approach: Each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
 Divisive: This is a "top-down" approach: All observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a dendrogram.
26
Implementetion of Heirrichal clustering
27
After implementing same method for all region datasets, we get this result.
Observations:
1 From above result we want to find points that generates Highest energy and also
cheap price.
2 First, we will choose Optimal point for all the regions, we can see that second
highest energy point is cheap compare to highest point.
But from the calculation we can see it is not fulfilling the demand of 3 million kwh/a.
Now we will choose the point that have highest value of energy.
28
Although we are using highest energy point we are not fulfilling energy demand. Scenario
2 will be not feasible.
Conclusion
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 29/62
Code :
Project to find best location for Solar Farms
that can fullfill our Energy Requiement .
In [476… # import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [477… #import Datasets
df = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectInstalled Solar Plants.xls
df_1 = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectEnvironment Solar Data.x
In [478… df.head()
Out[478]: Model ID Sunshine Hours per year Size Solar Panel m2 Generated energy kWh/a
0 1 1418 794 233616
1 2 1474 1726 525410
2 3 1335 5776 1612292
3 4 1224 6494 1681651
4 5 1320 2085 576313
In [479… df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Model ID 19 non-null int64
1 Sunshine Hours per year 19 non-null int64
2 Size Solar Panel m2 19 non-null int64
3 Generated energy kWh/a 19 non-null int64
dtypes: int64(4)
memory usage: 736.0 bytes
Obsevation : See that we want to find
Energy per m^2 ..so we can find it if we
devide Genrated energy / Solar panel
m2 ""
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 30/62
In [480… df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel
m2'] In [481… df.head()
Out[481]: Model Sunshine Hours per Size Solar Panel Generated energy energy_per_m2
ID year m2 kWh/a
0 1 1418 794 233616 294.226700
1 2 1474 1726 525410 304.409038
2 3 1335 5776 1612292 279.136427
3 4 1224 6494 1681651 258.954573
4 5 1320 2085 576313 276.409113
EDA of installed plant dataset.
In [482… sns.pairplot(df)
Out[482]: <seaborn.axisgrid.PairGrid at 0x21012d5f6d0>
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 31/62
Observation :
1) Enregy per m^2 is directly propsnal to Sunshine Hours per year.
In [483… # Let's take close look by jointplot.
sns.jointplot(x='Sunshine Hours per year',y='energy_per_m2',data = df)
Out[483]: <seaborn.axisgrid.JointGrid at 0x21012d5fa30>
Observation :
From the straight line we can think about
implemanting linear regration model.
In [484… #Let's see how much is it corelating..
#we find corelation and plot it with heatmap.
sns.heatmap(df.corr(),annot = True)
Out[484]: <AxesSubplot:>
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 32/62
Model Implementaion:
In [485… # creating Train and Test data:
X = df[[ 'Sunshine Hours per year']]
y = df['energy_per_m2']
In [486… # make a Split in the Datasets and importing Linear Regression model and fiting on
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_sta
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Out[486]: LinearRegression()
Result of Regression
In [487… # intercept is value of C in y = mx + c
print(lm.intercept_)
36.40591577945423
In [488… # coefficent is slop of the line :
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 33/62
lm.coef_
Out[488]: array([0.18182098])
In [489… #Let's check predicted values
predictions = lm.predict(X_test)
predictions
Out[489]: array([258.95479913, 250.04557096, 304.41004491, 279.13692826,
294.22806986, 248.22736112, 260.40936699, 255.13655848])
In [490… #We can check how it varries from actual values by plotting scatter plot.
plt.scatter(y_test,predictions)
Out[490]: <matplotlib.collections.PathCollection at 0x21015f39ee0>
Observation:
From graph we can see there is almost no deveation
from y_test & prediction .
In [491… from sklearn import metrics
metrics.mean_absolute_error(y_test,predictions)
Out[491]: 0.00046171013969242836
In [492… from sklearn.metrics import r2_score
r2_score(y_test, predictions)
Out[492]: 0.9999999989429613
From result we can see that there is almost zero error in our model and R2 score is also
nearly 1. which is best possible outcome.
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 34/62
Now lets explore second database.
In [493… df_1.head()
Out[493]: Date Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices
0 2018-01-01 0.22 0.27 45.36 2.560 304.0
1 2018-01-01 0.28 0.23 34.02 3.216 318.0
2 2018-01-01 0.28 0.27 45.36 3.144 102.0
3 2018-01-01 0.35 0.20 30.78 3.664 248.0
4 2018-01-01 0.18 0.30 33.21 2.632 326.0
In [494… df_1.describe()
Out[494]: Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices
count 3304.000000 3304.000000 3302.000000 3303.000000 3303.000000
mean 0.448610 0.495627 99.031788 3.776776 146.799273
std 0.266589 0.255900 51.760654 17.347552 78.796555
min 0.014000 0.002000 20.160000 2.400000 57.000000
25% 0.210000 0.250000 48.640000 3.042000 92.000000
50% 0.350000 0.514000 99.560000 3.474000 126.000000
75% 0.720000 0.729000 140.800000 3.897000 159.000000
max 0.873000 0.929000 1000.000000 1000.000000 330.000000
In [495… df_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3304 entries, 0 to 3303
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- ----- 0
Date 3304 non-null datetime64[ns] 1
Longitude 3304 non-null float64 2
Latitude 3304 non-null float64 3
Sunhine Hours 3302 non-null float64 4
Avg. Wind Speed 3303 non-null float64 5
Property prices 3303 non-null float64 dtypes:
datetime64[ns](1), float64(5)
memory usage: 155.0 KB
In [496… #EDA of dataset
sns.pairplot(df_1)
Out[496]: <seaborn.axisgrid.PairGrid at 0x21015dacac0>
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 35/62
In [497… df_1.groupby(['Date']).count()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 36/62
Out[497]: Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices
Date
2018-01-01 59 59 59 59 59
2018-02-01 59 59 59 59 59
2018-03-01 59 59 58 59 59
2018-04-01 59 59 59 59 59
2018-05-01 59 59 59 59 59
2018-06-01 59 59 59 59 59
2018-07-01 59 59 59 59 59
2018-08-01 59 59 59 59 59
2018-09-01 59 59 59 59 59
2018-10-01 59 59 58 59 59
2018-11-01 59 59 59 59 59
2018-12-01 59 59 59 59 59
2019-01-01 59 59 59 59 59
2019-02-01 59 59 59 59 59
2019-03-01 59 59 59 59 59
2019-04-01 59 59 59 59 59
2019-05-01 59 59 59 58 59
2019-06-01 59 59 59 59 59
2019-07-01 59 59 59 59 59
2019-08-01 59 59 59 59 59
2019-09-01 59 59 59 59 59
2019-10-01 59 59 59 59 59
2019-11-01 59 59 59 59 59
2019-12-01 59 59 59 59 59
2020-01-01 59 59 59 59 59
2020-02-01 59 59 59 59 59
2020-03-01 59 59 59 59 59
2020-04-01 59 59 59 59 59
2020-05-01 59 59 59 59 59
2020-06-01 59 59 59 59 59
2020-07-01 59 59 59 59 59
2020-08-01 59 59 59 59 59
2020-09-01 59 59 59 59 59
2020-10-01 59 59 59 59 59
2020-11-01 59 59 59 59 59
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 37/62
Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices
Date
2020-12-01 59 59 59 59 59
2021-01-01 59 59 59 59 59
2021-02-01 59 59 59 59 59
2021-03-01 59 59 59 59 59
2021-04-01 59 59 59 59 59
2021-05-01 59 59 59 59 59
2021-06-01 59 59 59 59 59
2021-07-01 59 59 59 59 59
2021-08-01 59 59 59 59 59
2021-09-01 59 59 59 59 59
2021-10-01 59 59 59 59 58
2021-11-01 59 59 59 59 59
2021-12-01 59 59 59 59 59
2022-01-01 59 59 59 59 59
2022-02-01 59 59 59 59 59
2022-03-01 59 59 59 59 59
2022-04-01 59 59 59 59 59
2022-05-01 59 59 59 59 59
2022-06-01 59 59 59 59 59
2022-07-01 59 59 59 59 59
2022-08-01 59 59 59 59 59
we can from above two result that there are 59 uniqe properties's data is available to
us . Also from pair plot we can see properties devided into 4 Clusters.
In datasets we can see data for 56 months so we have to convert it in yearly format.
Price and Wind Energy are reamaining same thorought time period for each
prpoperty. We will create dataset for 59 properties.
In [498… # finding uniqe properties.
locations_ = df_1[['Longitude','Latitude']].drop_duplicates()
locations_
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 38/62
Out[498]: Longitude Latitude
0 0.220 0.270
1 0.280 0.230
2 0.280 0.270
3 0.350 0.200
4 0.180 0.300
5 0.310 0.320
6 0.300 0.250
7 0.200 0.200
8 0.230 0.250
9 0.210 0.210
10 0.220 0.700
11 0.280 0.680
12 0.280 0.690
13 0.350 0.700
14 0.180 0.800
15 0.310 0.750
16 0.300 0.720
17 0.200 0.770
18 0.230 0.760
19 0.210 0.740
20 0.720 0.700
21 0.640 0.680
22 0.630 0.690
23 0.680 0.700
24 0.770 0.800
25 0.770 0.750
26 0.760 0.720
27 0.740 0.770
28 0.720 0.760
29 0.700 0.740
30 0.720 0.220
31 0.640 0.260
32 0.630 0.280
33 0.680 0.250
34 0.770 0.180
35 0.770 0.310
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 39/62
Longitude Latitude
36 0.760 0.300
37 0.740 0.200
38 0.720 0.230
39 0.700 0.210
40 0.233 0.929
41 0.617 0.514
42 0.373 0.002
43 0.864 0.838
44 0.081 0.805
45 0.124 0.413
46 0.164 0.106
47 0.137 0.710
48 0.064 0.835
49 0.160 0.173
50 0.014 0.729
51 0.025 0.472
52 0.715 0.211
53 0.808 0.505
54 0.873 0.379
55 0.856 0.100
56 0.233 0.623
57 0.567 0.727
58 0.180 0.611
In [499… #Creating Avrage Sunrise hours per month for each property.
df_1.groupby(['Longitude','Latitude'])['Sunhine Hours'].mean().reset_index()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 40/62
Out[499]: Longitude Latitude Sunhine Hours
0 0.014 0.729 96.410714
1 0.025 0.472 94.302857
2 0.064 0.835 94.963929
3 0.081 0.805 94.248214
4 0.124 0.413 96.712857
5 0.137 0.710 98.401786
6 0.160 0.173 96.977455
7 0.164 0.106 93.440714
8 0.180 0.300 107.601830
9 0.180 0.611 94.173571
10 0.180 0.800 108.632009
11 0.200 0.200 107.670134
12 0.200 0.770 107.224554
13 0.210 0.210 105.145714
14 0.210 0.740 104.628616
15 0.220 0.270 107.102411
16 0.220 0.700 105.796607
17 0.230 0.250 106.368348
18 0.230 0.760 105.567589
19 0.233 0.623 93.926071
20 0.233 0.929 94.735714
21 0.280 0.230 107.218527
22 0.280 0.270 106.851696
23 0.280 0.680 108.749732
24 0.280 0.690 105.909911
25 0.300 0.250 106.069821
26 0.300 0.720 105.526205
27 0.310 0.320 105.487634
28 0.310 0.750 106.812723
29 0.350 0.200 106.869375
30 0.350 0.700 109.638409
31 0.373 0.002 95.600357
32 0.567 0.727 111.343214
33 0.617 0.514 92.911429
34 0.630 0.280 91.532143
35 0.630 0.690 96.505357
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 41/62
Longitude Latitude Sunhine Hours
36 0.640 0.260 95.721429
37 0.640 0.680 93.820357
38 0.680 0.250 95.400000
39 0.680 0.700 92.683929
40 0.700 0.210 95.352857
41 0.700 0.740 94.743571
42 0.715 0.211 93.179286
43 0.720 0.220 90.614286
44 0.720 0.230 94.259643
45 0.720 0.700 93.306786
46 0.720 0.760 95.486429
47 0.740 0.200 92.203929
48 0.740 0.770 96.747500
49 0.760 0.300 93.957143
50 0.760 0.720 96.785357
51 0.770 0.180 92.680357
52 0.770 0.310 97.258929
53 0.770 0.750 95.524643
54 0.770 0.800 96.507143
55 0.808 0.505 92.385714
56 0.856 0.100 95.538929
57 0.864 0.838 93.215357
58 0.873 0.379 94.596429
In [500… df_1.groupby(['Longitude','Latitude'])['Avg. Wind Speed'].mean().reset_index()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 42/62
Out[500]: Longitude Latitude Avg. Wind Speed
0 0.014 0.729 3.680839
1 0.025 0.472 3.540536
2 0.064 0.835 3.565125
3 0.081 0.805 3.615589
4 0.124 0.413 3.676500
5 0.137 0.710 3.593732
6 0.160 0.173 3.511929
7 0.164 0.106 3.597750
8 0.180 0.300 3.360143
9 0.180 0.611 3.478821
10 0.180 0.800 3.196000
11 0.200 0.200 3.168286
12 0.200 0.770 3.327143
13 0.210 0.210 3.179000
14 0.210 0.740 3.243714
15 0.220 0.270 3.214571
16 0.220 0.700 3.245143
17 0.230 0.250 3.097286
18 0.230 0.760 3.140571
19 0.233 0.623 3.607393
20 0.233 0.929 3.706714
21 0.280 0.230 3.232714
22 0.280 0.270 3.204429
23 0.280 0.680 3.212714
24 0.280 0.690 3.137429
25 0.300 0.250 3.321429
26 0.300 0.720 20.975714
27 0.310 0.320 3.174714
28 0.310 0.750 3.206571
29 0.350 0.200 3.235286
30 0.350 0.700 3.258857
31 0.373 0.002 3.812143
32 0.567 0.727 3.604982
33 0.617 0.514 3.448929
34 0.630 0.280 3.697875
35 0.630 0.690 3.590196
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 43/62
Longitude Latitude Avg. Wind Speed
36 0.640 0.260 3.705429
37 0.640 0.680 3.495375
38 0.680 0.250 3.606429
39 0.680 0.700 3.578625
40 0.700 0.210 3.546321
41 0.700 0.740 3.635679
42 0.715 0.211 3.567857
43 0.720 0.220 3.610607
44 0.720 0.230 3.617357
45 0.720 0.700 3.485571
46 0.720 0.760 3.708321
47 0.740 0.200 3.597911
48 0.740 0.770 3.554196
49 0.760 0.300 3.630857
50 0.760 0.720 3.608679
51 0.770 0.180 3.552107
52 0.770 0.310 3.673768
53 0.770 0.750 3.644357
54 0.770 0.800 3.588218
55 0.808 0.505 3.583125
56 0.856 0.100 3.668304
57 0.864 0.838 3.624429
58 0.873 0.379 3.682125
In [501… df_1.groupby(['Longitude','Latitude'])['Property prices'].mean().reset_index()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 44/62
Out[501]: Longitude Latitude Property prices
0 0.014 0.729 67.0
1 0.025 0.472 116.0
2 0.064 0.835 91.0
3 0.081 0.805 65.0
4 0.124 0.413 126.0
5 0.137 0.710 86.0
6 0.160 0.173 130.0
7 0.164 0.106 95.0
8 0.180 0.300 326.0
9 0.180 0.611 127.0
10 0.180 0.800 276.0
11 0.200 0.200 105.0
12 0.200 0.770 224.0
13 0.210 0.210 273.0
14 0.210 0.740 312.0
15 0.220 0.270 304.0
16 0.220 0.700 174.0
17 0.230 0.250 159.0
18 0.230 0.760 137.0
19 0.233 0.623 73.0
20 0.233 0.929 128.0
21 0.280 0.230 318.0
22 0.280 0.270 102.0
23 0.280 0.680 131.0
24 0.280 0.690 330.0
25 0.300 0.250 129.0
26 0.300 0.720 245.0
27 0.310 0.320 232.0
28 0.310 0.750 320.0
29 0.350 0.200 248.0
30 0.350 0.700 277.0
31 0.373 0.002 139.0
32 0.567 0.727 149.0
33 0.617 0.514 137.0
34 0.630 0.280 61.0
35 0.630 0.690 150.0
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 45/62
Longitude Latitude Property prices
36 0.640 0.260 117.0
37 0.640 0.680 74.0
38 0.680 0.250 134.0
39 0.680 0.700 93.0
40 0.700 0.210 93.0
41 0.700 0.740 149.0
42 0.715 0.211 107.0
43 0.720 0.220 82.0
44 0.720 0.230 98.0
45 0.720 0.700 73.0
46 0.720 0.760 90.0
47 0.740 0.200 72.0
48 0.740 0.770 87.0
49 0.760 0.300 96.0
50 0.760 0.720 137.0
51 0.770 0.180 57.0
52 0.770 0.310 139.0
53 0.770 0.750 108.0
54 0.770 0.800 92.0
55 0.808 0.505 99.0
56 0.856 0.100 112.0
57 0.864 0.838 74.0
58 0.873 0.379 115.0
now Merging all columns together:
In [502… locations_ = locations_.sort_values(by=['Longitude', 'Latitude'],)
In [503… locations_['Avg Sunshine hours'] =
df_1.groupby(['Longitude','Latitude'])['Sunhine In [504… locations_['Avg Wind speed'] =
df_1.groupby(['Longitude','Latitude'])['Avg. Wind S In [505… locations_['Avg Price'] =
df_1.groupby(['Longitude','Latitude'])['Property prices In [506… locations_.head()
Out[506]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price
50 0.014 0.729 96.785357 3.608679 137.0
51 0.025 0.472 92.680357 3.552107 57.0
48 0.064 0.835 96.747500 3.554196 87.0
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 46/62
44 0.081 0.805 94.259643 3.617357 98.0
45 0.124 0.413 93.306786 3.485571 73.0
now We have dataset of 59 properties but we do not know
in which region / Area they are.
In [507… # we plot the longatude vs latitude so we can see the properties.
Reg = np.array(locations_[['Longitude','Latitude']])
plt.scatter(Reg[:,0],Reg[:,1])
Out[507]: <matplotlib.collections.PathCollection at 0x2101ea7df10>
we can see there are 4 region . we do not have any labels available so we have to
use unsupervised learning for prediction of the clusters.
We will use K-means algorithm for clustering
In [508… from sklearn.cluster import KMeans
Kmeans = KMeans(n_clusters = 4)
Kmeans.fit(Reg)
Out[508]: KMeans(n_clusters=4)
In [509… Reg_list = Kmeans.labels_
Reg_list
Out[509]: array([3, 3, 3, 3, 0, 3, 0, 0, 0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 3, 3, 3, 0,
0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2,
2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2])
In [510… #from these we can define which region are which.
Kmeans.cluster_centers_
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 47/62
Out[510]: array([[0.2415 , 0.22814286],
[0.71328571, 0.70671429],
[0.73646154, 0.24076923],
[0.19594444, 0.72355556]])
In [511… fig,(ax1) = plt.subplots(1, sharey=True, figsize = (5,5))
ax1.set_title('Saperate Regions using Kmeans')
ax1.scatter(Reg[:,0],Reg[:,1],c= Kmeans.labels_,cmap='rainbow')
Out[511]: <matplotlib.collections.PathCollection at 0x2101eacacd0>
In [512… locations_['region'] = Reg_list
In [513… locations_.head()
Out[513]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region
50 0.014 0.729 96.785357 3.608679 137.0 3
51 0.025 0.472 92.680357 3.552107 57.0 3
48 0.064 0.835 96.747500 3.554196 87.0 3
44 0.081 0.805 94.259643 3.617357 98.0 3
45 0.124 0.413 93.306786 3.485571 73.0 0
In [514… #defining Region using Hot-Encoding.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df =
pd.DataFrame(encoder.fit_transform(locations_[['region']]).toarray()) locations_1
= locations_.join(encoder_df)
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 48/62
locations_1.columns = ['Longitude','Latitude','Avg Sunhine Hours','Avg. Wind
Speed 'SW','SE']
locations_1.head()
Out[514]: Longitude Latitude Avg Sunhine Avg. Wind Avg region NE NW SW SE
Hours Speed prices
50 0.014 0.729 96.785357 3.608679 137.0 3 0.0 1.0 0.0 0.0
51 0.025 0.472 92.680357 3.552107 57.0 3 0.0 0.0 1.0 0.0
48 0.064 0.835 96.747500 3.554196 87.0 3 0.0 1.0 0.0 0.0
44 0.081 0.805 94.259643 3.617357 98.0 3 0.0 0.0 1.0 0.0
45 0.124 0.413 93.306786 3.485571 73.0 0 0.0 1.0 0.0 0.0
In [515… #defining Region by lamda function:
locations_['Area']= locations_['region'].apply(lambda region:"South-West" if region
In [516… locations_.head()
Out[516]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region Area
50 0.014 0.729 96.785357 3.608679 137.0 3 South-West
51 0.025 0.472 92.680357 3.552107 57.0 3 South-West
48 0.064 0.835 96.747500 3.554196 87.0 3 South-West
44 0.081 0.805 94.259643 3.617357 98.0 3 South-West
45 0.124 0.413 93.306786 3.485571 73.0 0 South-East
We want sunrise hours in yearly manner that's why we will multiply it by 12
In [517… locations_['Avg Sunshine hours'] = locations_['Avg Sunshine hours']*12
Now Predicting energy per one m^2 using coeficent and intercept
of Linear Regression
In [518… locations_['Pridected Energy'] = lm.coef_[0]*locations_['Avg Sunshine hours'] + lm
locations_.head()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 49/62
Out[518]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
50 0.014 0.729 1161.424286 3.608679 137.0 3
South-
West
247.577221
51 0.025 0.472 1112.164286 3.552107 57.0 3
South-
West
238.620720
48 0.064 0.835 1160.970000 3.554196 87.0 3
South-
West
247.494623
44 0.081 0.805 1131.115714 3.617357 98.0 3
South-
West
242.066487
45 0.124 0.413 1119.681429 3.485571 73.0 0
South-
East
239.987494
Now we will group data by Region and do the analysis.
In [519… sns.countplot(x='Area',data = locations_)
Out[519]: <AxesSubplot:xlabel='Area', ylabel='count'>
In [520… locations_.groupby(['Area']).count()
Out[520]: Longitude Latitude Avg Sunshine Avg Wind Avg region Pridected
hours speed Price Energy
Area
North-
East
13 13 13 13 13 13 13
North-
West
14 14 14 14 14 14 14
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 50/62
South-
East
14 14 14 14 14 14 14
South-
West
18 18 18 18 18 18 18
In [521… sns.barplot(x = 'Area', y = 'Pridected Energy',data = locations_, estimator =
max) Out[521]: <AxesSubplot:xlabel='Area', ylabel='Pridected Energy'>
In [522… locations_.groupby(['Area'], sort=False)['Pridected
Energy'].max() Out[522]: Area
South-West 273.424860
South-East 271.177163
North-West 273.681714
North-East 279.340308
Name: Pridected Energy, dtype: float64
In [523… sns.barplot(x = 'Area', y = 'Avg Price',data = locations_, estimator =
max) Out[523]: <AxesSubplot:xlabel='Area', ylabel='Avg Price'>
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 51/62
In [524… sns.barplot(x = 'Area', y = 'Avg Price',data = locations_, estimator =
min) Out[524]: <AxesSubplot:xlabel='Area', ylabel='Avg Price'>
In [525… locations_.groupby(['Area'], sort=False)['Avg Price'].min()
Out[525]: Area
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 52/62
South-West 57.0
South-East 65.0
North-West 74.0
North-East 61.0
Name: Avg Price, dtype: float64
In [526… NW_ = locations_[locations_['Area'] == 'North-West']
NE_ = locations_[locations_['Area'] == 'North-East']
SE_ = locations_[locations_['Area'] == 'South-East']
SW_ = locations_[locations_['Area'] == 'South-West']
For scenario 1 we will see how we can optimize cost
and energy
In [527… NW_
Out[527]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
57 0.567 0.727 1118.584286 3.624429 74.0 1
North-
West
239.788010
41 0.617 0.514 1136.922857 3.635679 149.0 1
North-
West
243.122347
22 0.630 0.690 1282.220357 3.204429 102.0 1
North-
West
269.540482
21 0.640 0.680 1286.622321 3.232714 318.0 1
North-
West
270.340851
23 0.680 0.700 1304.996786 3.212714 131.0 1
North-
West
273.681714
29 0.700 0.740 1282.432500 3.235286 248.0 1
North-
West
269.579054
20 0.720 0.700 1136.828571 3.706714 128.0 1
North-
West
243.105204
28 0.720 0.760 1281.752679 3.206571 320.0 1
North-
West
269.455448
27 0.740 0.770 1265.851607 3.174714 232.0 1
North-
West
266.564299
26 0.760 0.720 1266.314464 20.975714 245.0 1
North-
West
266.648457
25 0.770 0.750 1272.837857 3.321429 129.0 1
North-
West
267.834546
24 0.770 0.800 1270.918929 3.137429 330.0 1
North-
West
267.485645
53 0.808 0.505 1146.295714 3.644357 108.0 1
North-
West
244.826530
43 0.864 0.838 1087.371429 3.610607 82.0 1
North-
West
234.112858
In [528… # we wiil sort value of colums price and energy and choose the row where price is
NW_sorted = NW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 53/62
best_row_nw = NW_sorted.head(1)
best_row_nw
Avg Sunshine Avg Wind Avg Pridected
Out[528]: Longitude Latitude region Area
hours speed Price Energy
North-
57 0.567 0.727 1118.584286 3.624429 74.0 1 239.78801
West
In [529… SE_
Out[529]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
45 0.124 0.413 1119.681429 3.485571 73.0 0
South-
East
239.987494
49 0.160 0.173 1127.485714 3.630857 96.0 0
South-
East
241.406477
46 0.164 0.106 1145.837143 3.708321 90.0 0
South-
East
244.743152
4 0.180 0.300 1160.554286 3.676500 126.0 0
South-
East
247.419037
7 0.200 0.200 1121.288571 3.597750 95.0 0
South-
East
240.279706
9 0.210 0.210 1130.082857 3.478821 127.0 0
South-
East
241.878692
0 0.220 0.270 1156.928571 3.680839 67.0 0
South-
East
246.759806
8 0.230 0.250 1291.221964 3.360143 326.0 0
South-
East
271.177163
1 0.280 0.230 1131.634286 3.540536 116.0 0
South-
East
242.160774
2 0.280 0.270 1139.567143 3.565125 91.0 0
South-
East
243.603134
6 0.300 0.250 1163.729455 3.511929 130.0 0
South-
East
247.996349
5 0.310 0.320 1180.821429 3.593732 86.0 0
South-
East
251.104029
3 0.350 0.200 1130.978571 3.615589 65.0 0
South-
East
242.041552
42 0.373 0.002 1118.151429 3.567857 107.0 0
South-
East
239.709308
In [530… SE_sorted = SE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
best_row_se = SE_sorted.head(1)
best_row_se
Out[530]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
South-
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 54/62
3 0.35 0.2 1130.978571 3.615589 65.0 0 242.041552
East
In [531… SW_
Out[531]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
50 0.014 0.729 1161.424286 3.608679 137.0 3
South-
West
247.577221
51 0.025 0.472 1112.164286 3.552107 57.0 3
South-
West
238.620720
48 0.064 0.835 1160.970000 3.554196 87.0 3
South-
West
247.494623
44 0.081 0.805 1131.115714 3.617357 98.0 3
South-
West
242.066487
47 0.137 0.710 1106.447143 3.597911 72.0 3
South-
West
237.581223
58 0.180 0.611 1135.157143 3.682125 115.0 3
South-
West
242.801303
14 0.180 0.800 1255.543393 3.243714 312.0 3
South-
West
264.690050
17 0.200 0.770 1276.420179 3.097286 159.0 3
South-
West
268.485888
19 0.210 0.740 1127.112857 3.607393 73.0 3
South-
West
241.338684
10 0.220 0.700 1303.584107 3.196000 276.0 3
South-
West
273.424860
18 0.230 0.760 1266.811071 3.140571 137.0 3
South-
West
266.738750
56 0.233 0.623 1146.467143 3.668304 112.0 3
South-
West
244.857699
40 0.233 0.929 1144.234286 3.546321 93.0 3
South-
West
244.451719
11 0.280 0.680 1292.041607 3.168286 105.0 3
South-
West
271.326191
12 0.280 0.690 1286.694643 3.327143 224.0 3
South-
West
270.354001
16 0.300 0.720 1269.559286 3.245143 174.0 3
South-
West
267.238433
15 0.310 0.750 1285.228929 3.214571 304.0 3
South-
West
270.087503
13 0.350 0.700 1261.748571 3.179000 273.0 3
South-
West
265.818281
In [532… SW_sorted = SW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
best_row_sw = SW_sorted.head(1)
best_row_sw
Out[532]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 55/62
South-
51 0.025 0.472 1112.164286 3.552107 57.0 3 238.62072
West
In [533… NE_
Out[533]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
32 0.630 0.280 1336.118571 3.604982 149.0 2
North-
East
279.340308
31 0.640 0.260 1147.204286 3.812143 139.0 2
North-
East
244.991727
33 0.680 0.250 1114.937143 3.448929 137.0 2
North-
East
239.124883
39 0.700 0.210 1112.207143 3.578625 93.0 2
North-
East
238.628512
52 0.715 0.211 1167.107143 3.673768 139.0 2
North-
East
248.610484
30 0.720 0.220 1315.660909 3.258857 277.0 2
North-
East
275.620676
38 0.720 0.230 1144.800000 3.606429 134.0 2
North-
East
244.554577
37 0.740 0.200 1125.844286 3.495375 74.0 2
North-
East
241.108031
36 0.760 0.300 1148.657143 3.705429 117.0 2
North-
East
245.255887
34 0.770 0.180 1098.385714 3.697875 61.0 2
North-
East
236.115486
35 0.770 0.310 1158.064286 3.590196 150.0 2
North-
East
246.966303
55 0.856 0.100 1108.628571 3.583125 99.0 2
North-
East
237.977853
54 0.873 0.379 1158.085714 3.588218 92.0 2
North-
East
246.970199
In [534… NE_sorted = NE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
best_row_ne = NE_sorted.head(1)
best_row_ne
Out[534]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
North-
34 0.77 0.18 1098.385714 3.697875 61.0 2 236.115486
East
In [535… Scenari0_1 = pd.concat([best_row_nw, best_row_se, best_row_sw, best_row_ne],
axis=0 In [536… Scenari0_1
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 56/62
Out[536]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected
hours speed Price Energy
57 0.567 0.727 1118.584286 3.624429 74.0 1
North-
West
239.788010
3 0.350 0.200 1130.978571 3.615589 65.0 0
South-
East
242.041552
51 0.025 0.472 1112.164286 3.552107 57.0 3
South-
West
238.620720
34 0.770 0.180 1098.385714 3.697875 61.0 2
North-
East
236.115486
In [537… # Calculate Energy
Total_Energy = (238.620720 * 3000) + (236.115486 * 2000) + (242.041552 * 2000) +
(2 In [538… Total_Energy = round(Total_Energy)
Total_Energy
Out[538]: 2031858
In [539… #Cost
Total_cost = round((3000 * 157) + (2000 * 161) + (2000 * 165) + (1500 *
174)) Total_cost
Out[539]: 1384000
In [540… print(" Total Energy genrated in KWH/a:",Total_Energy)
print(" Total cost in Euro:",Total_cost)
print(" total area we occupying : 8500 m^2")
Total Energy genrated in KWH/a: 2031858
Total cost in Euro: 1384000
total area we occupying : 8500 m^2
Scenario 2
We can also use above method to solve scenario 2 but we
will try it using Hierarchical clustering :
In [541… from sklearn.cluster import AgglomerativeClustering
# Extract the two columns of features that you want to use for clustering
NW_copmare = NW_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_NW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_NW.fit(NW_copmare)
# Predict the clusters for each data point
pred = cluster_NW.fit_predict(NW_copmare)
# Create a scatter plot of the clusters
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 57/62
plt.scatter(NW_copmare['Avg Price'], NW_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
In [542… pred
Out[542]: array([3, 0, 1, 2, 1, 4, 0, 2, 4, 4, 1, 2, 0, 3], dtype=int64)
In [543… SE_copmare = SE_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_SE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_NW.fit(SE_copmare)
# Predict the clusters for each data point
pred = cluster_SE.fit_predict(SE_copmare)
# Create a scatter plot of the clusters
plt.scatter(SE_copmare['Avg Price'], SE_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 58/62
In [544… SW_copmare = SW_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_SW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_SW.fit(SW_copmare)
# Predict the clusters for each data point
pred = cluster_SW.fit_predict(SW_copmare)
# Create a scatter plot of the clusters
plt.scatter(SW_copmare['Avg Price'], SW_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 59/62
In [545… NE_copmare = NE_[['Avg Price','Pridected Energy']]
# Create an instance of the AgglomerativeClustering class
cluster_NE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w
# Fit the model to the data
cluster_NE.fit(NE_copmare)
# Predict the clusters for each data point
pred = cluster_NE.fit_predict(NE_copmare)
# Create a scatter plot of the clusters
plt.scatter(NE_copmare['Avg Price'], NE_copmare['Pridected Energy'], c=pred,
cmap= plt.show()
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 60/62
WE will use all the optimal point which have best energy and cheap price
In [546… # SW + NW + NE + SE
Energy_op = (252*2000) + (272 * 3000) + (273 * 3000) + (280 *2000)
Energy_op
Out[546]: 2699000
Now we wil use point which have highest energy
In [547… Energy_hi = (271*2000) + (273.5 * 3000) + (273 * 3000) + (280 *2000)
Energy_hi
Out[547]: 2741500.0
In [548… cost = (376*3000) + (249*2000) + (231 * 3000) + (426*2000)
cost
Out[548]: 3171000
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 61/62
Conclusion
Scenario 1
We can full fil the demand of 2 milllion kwh/a energy Easily and we
dont even need 2 million budget.
We are also not using all the area that proposed and area with highest
price of the cheapest price "North West " we are just using half land
there. so it is saving money.
Scenario 2
First we used the point that are second highest energy and cheap compare to
highest energy point, but we are not full fiiling energy demand
Now we are taking the point which have highest energy without
cosidering cost . although it is not full filling demand. We need more
land and little more money.
1/9/23, 1:39 PM Final_1
localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 62/62
Bibliography
Burkov, A. (2020). The Hundred -Page machine learning book.

More Related Content

Similar to AI Final report 1.pdf

CS301-lec01.ppt
CS301-lec01.pptCS301-lec01.ppt
CS301-lec01.pptomair31
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
Semantic Web mining using nature inspired optimization methods
Semantic Web mining using nature inspired optimization methodsSemantic Web mining using nature inspired optimization methods
Semantic Web mining using nature inspired optimization methodslucianb
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286Ninad Samel
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Clustering in Machine Learning.pdf
Clustering in Machine Learning.pdfClustering in Machine Learning.pdf
Clustering in Machine Learning.pdfSudhanshiBakre1
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processinginventionjournals
 
Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Barry DeCicco
 

Similar to AI Final report 1.pdf (20)

CS301-lec01.ppt
CS301-lec01.pptCS301-lec01.ppt
CS301-lec01.ppt
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Semantic Web mining using nature inspired optimization methods
Semantic Web mining using nature inspired optimization methodsSemantic Web mining using nature inspired optimization methods
Semantic Web mining using nature inspired optimization methods
 
5 parallel implementation 06299286
5 parallel implementation 062992865 parallel implementation 06299286
5 parallel implementation 06299286
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Algorithms 14-00122
Algorithms 14-00122Algorithms 14-00122
Algorithms 14-00122
 
Clustering in Machine Learning.pdf
Clustering in Machine Learning.pdfClustering in Machine Learning.pdf
Clustering in Machine Learning.pdf
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Data Structure Lec #1
Data Structure Lec #1Data Structure Lec #1
Data Structure Lec #1
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
 
Deep learning (2)
Deep learning (2)Deep learning (2)
Deep learning (2)
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processing
 
Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Lect4
Lect4Lect4
Lect4
 
Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06
 
Machine Learning... a piece of cake!
Machine Learning... a piece of cake!Machine Learning... a piece of cake!
Machine Learning... a piece of cake!
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

AI Final report 1.pdf

  • 1. 1 Project Report on PREDICTION OF BEST LOCATION FOR SOLAR FARM IN ORDER TO MEET ENERGY DEMAND AND COMPANY PROFIT. Submi ed by SHRUTEJ JARIWALA PARSHWA BHAVSAR VIRAL SUREJA VISHNUVARDHAN CHOWDARY SUBJECT – AI BASICS PROF: JEAN-MICHEL TAVERNE
  • 2. 2 1. Problem Summery Finding location for solar farm, which can fulfil customer need of energy demand: 1) 2 million kWh/a. Dataset 2) 3 million kWh/a Limitations: The building space in the regions is limited so may be built in the regions max following area: -North-West: 3,000m2 -North-East: 3,000m2 -South-West: 2,000 m2 -South-East: 2.000m2 For one square meter of solar plant Smart Energy LLC has to pay 100€ for the material plus the cost of the land. -2 million € for scenario -1 and -In order to fulfil scenario 2, a budget of 3 million € can be invested. Objective We have to find best solar farm location which can fulfil the energy need of the consumer along with Company profit. For that we want to apply machine learning techniques in order to find solution of this problem. 1.1. What is Machine Learning? Machine learning is a subfield of computer science that is concerned with building algorithms which, to be useful, rely on a collection of examples of some phenomenon. These examples can come from nature, be handcrafted by humans or generated by another algorithm. Machine learning can also be defined as the process of solving a practical problem by 1) gathering a dataset, 2) algorithmically building a statistical model based on that dataset. That statistical model is assumed to be used somehow to solve the practical problem. To save keystrokes, I use the terms “learning” and “machine learning” interchangeably. (Burkov, 2020) Types of learning can be supervised, semi-supervised, unsupervised and reinforcement. 1.2. Supervised Learning In supervised learning1, the dataset is the collection of labeled examples {(xi, yi)}N i=1.Each element xi among N is called a feature vector. A feature vector is a vector in which each dimension j = 1, . . ., D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j). For instance, if each example x in our collection represents a person, then the first feature, x(1), could contain height in cm, the second feature, x(2), could contain weight in kg, x(3) could contain
  • 3. 3 gender, and so on. For all examples in the dataset, the feature at position j in the feature vector always contains the same kind of information. It means that if x(2) i contains weight in kg in some example xi,then x(2) k will also contain weight in kg in every example x k, k = 1, . . . , N . The label yi can be either an element belonging to a finite set of classes {1, 2, . . ., C}, or a real number, or a more complex structure, like a vector, a matrix, a tree, or a graph. Unless otherwise stated, is either one of a finite set of classes or a real number2. You can see a class as a category to which an example belongs. For instance, if your examples are email messages and your problem is spam detection, then you have two classes {spam, not spam}. The goal of a supervised learning algorithm is to use the dataset to produce a model that takes a feature vector x as input and outputs information that allows deducing the label for this feature vector. For instance, the model created using the dataset of people could take as input a feature vector describing a person and output a probability that the person has cancer. 1.3. Unsupervised Learning In unsupervised learning, the dataset is a collection of unlabelled examples {xi}N i=1. Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to create a model that takes a feature vector x as input and either transforms it into another vector or into a value that can be used to solve a practical problem. For example, in clustering, the model returns the id of the cluster for each feature vector in the dataset. In dimensionality reduction, the output of the model is a feature vector that has fewer features than the input x; in outlier detection, the output is a real number that indicates how x is different from a “typical” example in the dataset. 1.4 Reinforcement Learning Reinforcement learning is a subfield of machine learning where the machine “lives” in an environment and is capable of perceiving the state of that environment as a vector of features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. 2. Datasets:  Installed Solar plants  New locations data sets. We have two datasets; first one “installed Solar plants” has data of 20 of already installed power plant’s data, which gives insight on every plant’s sunshine hours per year, solar panel m^2 and value of generated energy in kwh/a. Second data set has data of 56 months start from January 2018 to August 22 for 59 new unique location, which also gives sunrise hours, price per m^2, average wind speed.
  • 4. 4 3. Requirements:  Python  IDE: Jupiter notebook  Libraries: Pandas, NumPy, matplotlib, seaborn , scikit-learn 3.1. Libraries:  import pandas as pd - pandas is a popular Python-based data analysis toolkit which can be imported using import pandas as pd. It presents a diverse range of utilities, ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array. This makes pandas a trusted ally in data science and machine learning. Similar to NumPy, pandas deal primarily with data in 1-D and 2-D arrays; however, pandas handle the two differently  import matplotlib. pyplot as plt - matplotlib. pyplot is stateful, in that it keeps track of the current figure and plotting area, and the plotting functions are directed to the current axes and can be imported using import matplotlib. pyplot as plt.  import seaborn as sns - Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas’ data structures. Seaborn helps you explore and understand your data. Its plotting functions operate on data frames and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset- oriented, declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them  import NumPy as np - NumPy provides a large set of numeric datatypes that you can use to construct arrays. NumPy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. 4. Initial thoughts and Observation  We can observe that both datasets have one common column have “sunrise hour”.  For every region we have limited m^2 per area, if we can find out how much energy is generated in one m^2 area in every region we can know the how much energy can be generated.  We also want to find out if there is any corelation between sunshine hour and energy per m^2. 5. Solution process: 6.1 importing data: Import data using pandas library.
  • 5. 5 We divided Generated energy by size of solar panel area m^2,then we find energy for one meter square area. 6.2 Data exploration : installed plant data From the use of seaborn library, we plotted pair plot of the data df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel m2'] sns.pairplot(df)
  • 6. 6 We try to find out how much sunrise hour per year column is corelated to energy per meter^2 column. Observation: 1) Energy per m^2 is directly propotional to Sunshine Hours per year. 2) Solar panel m^2 is directly propotional to sunshine Hours per year. #Let's see how much is it corelating.. #we find corelation and plot it with heatmap. sns.heatmap(df.corr(),annot = True)
  • 7. 7 Observation: 1) Energy per meter^2 is completely dependent on sunshine hour per year . 2) We can predict energy per meter^2 by sunshine hours per year, So we can apply regression model. 6.3 Classification vs. Regression Classification is a problem of automatically assigning a label to an unlabelled example. Spam detection is a famous example of classification. In machine learning, the classification problem is solved by a classification learning algorithm that takes a collection of labelled examples as inputs and produces a model that can take an unlabelled example as input and either directly output a label or output a number that can be used by the analyst to deduce the label. An example of such a number is a probability. In a classification problem, a label is a member of a finite set of classes. If the size of the set of classes is two (“sick”/ “healthy”, “spam”/“not spam”), we talk about binary classification (also called binomial in some sources). Multiclass classification (also
  • 8. 8 called multinomial) is a classification problem with three or more classes. While some learning algorithms naturally allow for more than two classes, others are by nature binary classification algorithms. There are strategies allowing to turn a binary classification learning algorithm into a multiclass one. Regression is a problem of predicting a real-valued label (often called a target) given an unlabelled example. Estimating house price valuation based on house features, such as area, the number of bedrooms, location and so on is a famous example of regression. The regression problem is solved by a regression learning algorithm that takes a collection of labelled examples as inputs and produces a model that can take an unlabelled example as input and output a target. (Burkov, 2020) 6.4 Linear Regression Linear regression is a popular regression learning algorithm that learns a model which is a linear combination of features of the input example. 6.4.1 Problem Statement We have a collection of labeled examples {(xi , yi)} N i=1, where N is the size of the collection, xi is the D-dimensional feature vector of example i = 1, . . . , N, yi is a real- valued1 target and every feature x (j) i , j = 1, . . . , D, is also a real number. We want to build a model fw,b(x) as a linear combination of features of example : x: fw, b(x) = wx + b, where w is a D-dimensional vector of parameters and b is a real number. The notation fw,b means that the model f is parametrized by two values: w and b. We will use the model to predict the unknown y for a given x like this: y ← fw,b(x). Two models parametrized by two different pairs (w, b) will likely produce two different predictions when applied to the same example. We want to find the optimal values (w∗ , b∗ ). Obviously, the optimal values of parameters define the model that makes the most accurate predictions. You could have noticed that the form of our linear model in eq. 1 is very similar to the form of the SVM model. The only difference is the missing sign operator. The two models are indeed similar. However, the hyperplane in the SVM plays the role of the decision boundary: it’s used to separate two groups of examples from one another. As such, it has to be as far from each group as possible. On the other hand, the hyperplane in linear regression is chosen to be as close to all training examples as possible. You can see why this latter requirement is essential by looking at the illustration in Figure 1. It displays the regression line (in red) for one-dimensional examples (blue dots). We can use this line to predict the value of the target ynew for a new unlabelled input example xnew. If our examples are D-dimensional feature vectors (for D > 1), the only difference with the one-dimensional case is that the regression model is not a line but a plane or a hyperplane (for D > 2 ).
  • 9. 9 Now you see why it’s essential to have the requirement that the regression hyperplane lies as close to the training examples as possible: if the red line in Figure.1 was far from the blue dots, the prediction ynew would have fewer chances to be correct. 6.4.2 Solution To get this latter requirement satisfied, the optimization procedure which we use to find the optimal values for w∗ and b∗ tries to minimize the following expression: In mathematics, the expression we minimize or maximize is called an objective function, or, simply, an objective. The expression (fw,b(xi) − yi)^2 in the above objective is called the loss function. It’s a measure of penalty for misclassification of example I. This particular choice of the loss function is called squared error loss. All model-based learning algorithms have a loss function and what we do to find the best model is we try to minimize the objective known as the cost function. In linear regression, the cost function is given by the average loss, also called the empirical risk. The average loss, or empirical
  • 10. 10 risk, for a model, is the average of all penalties obtained by applying the model to the training data. Why is the loss in linear regression a quadratic function? Why couldn’t we get the absolute value of the difference between the true target yi and the predicted value f (xi) and use that as a penalty? We could. Moreover, we also could use a cube instead of a square. we decided to use the linear combination of features to predict the target. However, we could use a square or some other polynomial to combine the values of features. We could also use some other loss function that makes sense: the absolute difference between f (xi) and yi makes sense, the cube of the difference too; Sounds easy, doesn’t it? However, do not rush to invent a new learning algorithm. The fact that the binary loss (1 when f (xi) and yi are different and 0 when they are the same) also makes sense, right? If we made different decisions about the form of the model, the form of the loss function, and about the choice of the algorithm that minimizes the average loss to find the best values of parameters, we would end up inventing a different machine learning algorithm. (Burkov, 2020) Implementing Linear Regression: We took “Sunshine Hours” column as Feature and Energy per meter^2 as Label. Then split the data in 60/40 ratio for creating train and test data. and we imported Liner Regression model form scikit-learn library. Further we test data on remaining 40% data. By predict function we predict energy from test data. And then we compare predicted value to test labels. That is how we find out error function of our model. Algorithm also gives us regression coefficient and intercept.
  • 11. 11
  • 12. 12 Observation: From scatter plot we can observe the straight line. That shows little deviation and great accuracy. Then we check the absolute mean error and R^2 score. 6.5. Now we do analysis of Second Dataset: Location dataset.
  • 13. 13 Observation:  In first row we can see scatter plot of “longitude vs latitude” which gives location of properties. We can also observe four cluster of regions.  when we group data by Date, and count value we find there are 59 unique location and for each location 56 months of data is given.  We have data in monthly manner so we have to convert it in yearly format. Steps: 1) First, we will create datasets for 59 locations. 2) We classify location in region. 3) We have to find average Sunshine hours, average Price and average wind energy of each location for yearly manner. 4) Then we predict energy for each region. Finding average Sunshine Hours: For each location, we add all sun hours values of 56 months and divide by 56 that give average sunshine hours per month. Then we multiply into 12 so we get average value for one year. Predicting Regions: We have only two features [‘Longitude’, ‘Latitude’] and no labels; that is why we choose unsupervised learning for classification. We are preferring K-means clustering algorithm. 9.2 Clustering Clustering is a problem of learning to assign a label to examples by leveraging an unlabelled dataset. Because the dataset is completely unlabelled, deciding on whether the learned model is optimal is much more complicated than in supervised learning.
  • 14. 14 There is a variety of clustering algorithms, and, unfortunately, it’s hard to tell which one is better in quality for your dataset. Usually, the performance of each algorithm depends on the unknown properties of the probability distribution the dataset was drawn from. In this Chapter, I outline the most useful and widely used clustering algorithms. (Burkov, 2020) 9.2.1 K-Means The k-means clustering algorithm works as follows. First, you choose k — the number of clusters. Then you randomly put k feature vectors, called centroids, to the feature space. We then compute the distance from each example x to each centroid c using some metric, like the Euclidean distance. Then we assign the closest centroid to each example (like if we labelled each example with a centroid id as the label). For each centroid, we calculate the average feature vector of the examples labelled with it. These average feature vectors become the new locations of the centroids. We recompute the distance from each example to each centroid, modify the assignment and repeat the procedure until the assignments don’t change after the centroid locations were recomputed. The model is the list of assignments of centroids IDs to the examples. The initial position of centroids influences the final positions, so two runs of k-means can
  • 15. 15 result in two different models. Some variants of k-means compute the initial positions of centroids based on some properties of the dataset. One run of the k-means algorithm is illustrated in Figure 2. The circles in Figure 2 are two-dimensional feature vectors; the squares are moving centroids. Different background colours represent regions in which all points belong to the same cluster. The value of k, the number of clusters, is a hyperparameter that has to be tuned by the data analyst. There are some techniques for selecting k. None of them is proven optimal. Most of those techniques require the analyst to make an “educated guess” by looking at some metrics or by examining cluster assignments visually. 9.2.3 Determining the Number of Clusters The most important question is how many clusters does your dataset have? When the feature vectors are one-, two- or three-dimensional, you can look at the data and see “clouds” of points in the feature space. Each cloud is a potential cluster. However, for D-dimensional data, with D > 3, looking at the data is problematic. One way of determining the reasonable number of clusters is based on the concept of prediction strength. The idea is to split the data into training and test set, similarly to how we do in supervised learning. Once you have the training and test sets, Str of size Ntr and Ste of size N respectively, you fix k, the number of clusters, and run a clustering algorithm C on sets Str and Ste and obtain the clustering results C (Str, k) and C (Ste, k). Let A be the clustering C (Str, k) built using the training set. The clusters in A can be seen as regions. If an example falls within one of those regions, then that example belongs to some specific cluster. For example, if we apply the k-means algorithm to some dataset, it results in a partition of the feature space into k polygonal regions, as we saw in Figure 2. Define the N× N co-membership matrix D[A, Ste] as follows: D[A, Ste](i,i′ ) = 1 if and only if examples xi and xi′ from the test set belong to the same cluster according to the clustering A. Otherwise D[A, Ste](i,i′) = 0. Let’s take a break and see what we have here. We have built, using the training set of examples, a clustering A that has k clusters. Then we have built the co-membership matrix that indicates whether two examples from the test set belong to the same cluster in A. Intuitively, if the quantity k is the reasonable number of clusters, then two examples that belong to the same cluster in clustering C (Ste, k) will most likely belong to the same cluster
  • 16. 16 in clustering C (Str, k). On the other hand, if k is not reasonable (too high or too low), then training data-based and test data-based clustering will likely be less consistent. Another effective method to estimate the number of clusters is the gap statistic method. Other, less automatic methods, which some analysts still use, include the elbow method and the average silhouette method. Experiments suggest that a reasonable number of clusters is the largest k such that ps(k) is above 0.8. You can see in Figure 5 examples of predictive strength for different values of k for two, three- and four-cluster data. For non-deterministic clustering algorithms, such as k-means, which can generate different clustering depending on the initial positions of centroids, it is recommended to do multiple runs of the clustering algorithm for the same k and compute the average prediction strength ̄ps(k) over multiple runs. Implementing k-means clustering.
  • 17. 17 We observe Region column as categorical column and we can try to convert it into binary data by Feature Engineering Feature Engineering When a product manager tells you “We need to be able to predict whether a particular customer will stay with us. Here are the logs of customers’ interactions with our product for five years.” you cannot just grab the data, load it into a library and get a prediction. You need to build a dataset first.
  • 18. 18 Remember from the first chapter that the dataset is the collection of labeled examples {(xi, yi)} Ni=1. Each element xi among N is called a feature vector. A feature vector is a vector in which each dimension j = 1, . . ., D contains a value that describes the example somehow. That value is called a feature and is denoted as x(j). The problem of transforming raw data into a dataset is called feature engineering. For most practical problems, feature engineering is a labour-intensive process that demands from the data analyst a lot of creativity and, preferably, domain knowledge. For example, to transform the logs of user interaction with a computer system, one could create features that contain information about the user and various statistics extracted from the logs. For each user, one feature would contain the price of the subscription; other features would contain the frequency of connections per day, week and year. Another feature would contain the average session duration in seconds or the average response time for one request, and so on. Everything measurable can be used as a feature. The role of the data analyst is to create informative features: those would allow the learning algorithm to build a model that predicts well labels of the data used for training. Highly informative features are also called features with high predictive power. For example, the average duration of a user’s session has high predictive power for the problem of predicting whether the user will keep using the application in the future. We say that a model has a low bias when it predicts the training data well. That is, the model makes few mistakes when we use it to predict labels of the examples used to build the model. 5.1.1 One-Hot Encoding Some learning algorithms only work with numerical feature vectors. When some feature in your dataset is categorical, like “colors” or “days of the week,” you can transform such a categorical feature into several binary ones. If your example has a categorical feature “colors” and this feature has three possible values: “red,” “yellow,” “green,” you can transform this feature into a vector of three numerical values: red = [1, 0, 0] yellow = [0, 1, 0] green = [0, 0, 1] By doing so, you increase the dimensionality of your feature vectors. You should not transform red into 1, yellow into 2, and green into 3 to avoid increasing the dimensionality because that would imply that there’s an order among the values in this category and this specific order is important for the decision making. If the order of a feature’s values is not important, using ordered numbers as values is likely to confuse the learning algorithm,1 because the algorithm will try to find a regularity where there’s no one, which may potentially lead to overfitting. (Burkov, 2020)
  • 19. 19 Implementing hot-encoding : We can also assign region also by Lambda function using if else syntax.
  • 20. 20 Data analysis by regions: 1) Locations counts by region:
  • 21. 21  South-west has highest count of location. 2.) How much each region generating : :  North-east has highest energy of location. 3.) Finding cheapest prices for each region:
  • 22. 22  South west have cheapest locations Now we create data frame for each region.
  • 23. 23 Scenario: 1 We want to pick location which have cheap prices and best optimal Energy. Which can generate 2 million kwh/a energy and our budget is 2 million. total energy = ( energy * m^2) + ( energy * m^2) + ( energy * m^2) + ( energy * m^2) = (238.62 * 3000) + ( 236.11 * 2000) + (242.04 *3000) + (239.78 *2000) = 2393760 kwh/a > 2 million. Total cost = (price + cost) * m^2 for each region = 1645000 < 2 millions So, if we use all the area available to us and build their plant we can get enegy more than 2 million kwh/a and budget will be 1.6 million. But we can optimize it further by using only half of the land where price are heigest.
  • 24. 24 We can generate more than 2 million kwh/a in minimum budget of 1384000 Euro. Locations for Scenario 1: One can using this method also to predict scenario 2 where one must take location in region where highest energy is generated. But for scenario 2 we will try to use method of Hierarchical clustering .
  • 25. 25 Scenario 2 Hierarchical clustering: In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:  Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.  Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
  • 27. 27 After implementing same method for all region datasets, we get this result. Observations: 1 From above result we want to find points that generates Highest energy and also cheap price. 2 First, we will choose Optimal point for all the regions, we can see that second highest energy point is cheap compare to highest point. But from the calculation we can see it is not fulfilling the demand of 3 million kwh/a. Now we will choose the point that have highest value of energy.
  • 28. 28 Although we are using highest energy point we are not fulfilling energy demand. Scenario 2 will be not feasible. Conclusion
  • 29. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 29/62 Code : Project to find best location for Solar Farms that can fullfill our Energy Requiement . In [476… # import Libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline In [477… #import Datasets df = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectInstalled Solar Plants.xls df_1 = pd.read_excel(r'C:UsersSHRUTEJDesktopAI ProjectEnvironment Solar Data.x In [478… df.head() Out[478]: Model ID Sunshine Hours per year Size Solar Panel m2 Generated energy kWh/a 0 1 1418 794 233616 1 2 1474 1726 525410 2 3 1335 5776 1612292 3 4 1224 6494 1681651 4 5 1320 2085 576313 In [479… df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 19 entries, 0 to 18 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Model ID 19 non-null int64 1 Sunshine Hours per year 19 non-null int64 2 Size Solar Panel m2 19 non-null int64 3 Generated energy kWh/a 19 non-null int64 dtypes: int64(4) memory usage: 736.0 bytes Obsevation : See that we want to find Energy per m^2 ..so we can find it if we devide Genrated energy / Solar panel m2 ""
  • 30. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 30/62 In [480… df['energy_per_m2'] = df['Generated energy kWh/a']/df['Size Solar Panel m2'] In [481… df.head() Out[481]: Model Sunshine Hours per Size Solar Panel Generated energy energy_per_m2 ID year m2 kWh/a 0 1 1418 794 233616 294.226700 1 2 1474 1726 525410 304.409038 2 3 1335 5776 1612292 279.136427 3 4 1224 6494 1681651 258.954573 4 5 1320 2085 576313 276.409113 EDA of installed plant dataset. In [482… sns.pairplot(df) Out[482]: <seaborn.axisgrid.PairGrid at 0x21012d5f6d0>
  • 31. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 31/62 Observation : 1) Enregy per m^2 is directly propsnal to Sunshine Hours per year. In [483… # Let's take close look by jointplot. sns.jointplot(x='Sunshine Hours per year',y='energy_per_m2',data = df) Out[483]: <seaborn.axisgrid.JointGrid at 0x21012d5fa30> Observation : From the straight line we can think about implemanting linear regration model. In [484… #Let's see how much is it corelating.. #we find corelation and plot it with heatmap. sns.heatmap(df.corr(),annot = True) Out[484]: <AxesSubplot:>
  • 32. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 32/62 Model Implementaion: In [485… # creating Train and Test data: X = df[[ 'Sunshine Hours per year']] y = df['energy_per_m2'] In [486… # make a Split in the Datasets and importing Linear Regression model and fiting on from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_sta from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train,y_train) Out[486]: LinearRegression() Result of Regression In [487… # intercept is value of C in y = mx + c print(lm.intercept_) 36.40591577945423 In [488… # coefficent is slop of the line :
  • 33. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 33/62 lm.coef_ Out[488]: array([0.18182098]) In [489… #Let's check predicted values predictions = lm.predict(X_test) predictions Out[489]: array([258.95479913, 250.04557096, 304.41004491, 279.13692826, 294.22806986, 248.22736112, 260.40936699, 255.13655848]) In [490… #We can check how it varries from actual values by plotting scatter plot. plt.scatter(y_test,predictions) Out[490]: <matplotlib.collections.PathCollection at 0x21015f39ee0> Observation: From graph we can see there is almost no deveation from y_test & prediction . In [491… from sklearn import metrics metrics.mean_absolute_error(y_test,predictions) Out[491]: 0.00046171013969242836 In [492… from sklearn.metrics import r2_score r2_score(y_test, predictions) Out[492]: 0.9999999989429613 From result we can see that there is almost zero error in our model and R2 score is also nearly 1. which is best possible outcome.
  • 34. localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 34/62 Now lets explore second database. In [493… df_1.head() Out[493]: Date Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices 0 2018-01-01 0.22 0.27 45.36 2.560 304.0 1 2018-01-01 0.28 0.23 34.02 3.216 318.0 2 2018-01-01 0.28 0.27 45.36 3.144 102.0 3 2018-01-01 0.35 0.20 30.78 3.664 248.0 4 2018-01-01 0.18 0.30 33.21 2.632 326.0 In [494… df_1.describe() Out[494]: Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices count 3304.000000 3304.000000 3302.000000 3303.000000 3303.000000 mean 0.448610 0.495627 99.031788 3.776776 146.799273 std 0.266589 0.255900 51.760654 17.347552 78.796555 min 0.014000 0.002000 20.160000 2.400000 57.000000 25% 0.210000 0.250000 48.640000 3.042000 92.000000 50% 0.350000 0.514000 99.560000 3.474000 126.000000 75% 0.720000 0.729000 140.800000 3.897000 159.000000 max 0.873000 0.929000 1000.000000 1000.000000 330.000000 In [495… df_1.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3304 entries, 0 to 3303 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 3304 non-null datetime64[ns] 1 Longitude 3304 non-null float64 2 Latitude 3304 non-null float64 3 Sunhine Hours 3302 non-null float64 4 Avg. Wind Speed 3303 non-null float64 5 Property prices 3303 non-null float64 dtypes: datetime64[ns](1), float64(5) memory usage: 155.0 KB In [496… #EDA of dataset sns.pairplot(df_1) Out[496]: <seaborn.axisgrid.PairGrid at 0x21015dacac0>
  • 35. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 35/62 In [497… df_1.groupby(['Date']).count()
  • 36. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 36/62 Out[497]: Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices Date 2018-01-01 59 59 59 59 59 2018-02-01 59 59 59 59 59 2018-03-01 59 59 58 59 59 2018-04-01 59 59 59 59 59 2018-05-01 59 59 59 59 59 2018-06-01 59 59 59 59 59 2018-07-01 59 59 59 59 59 2018-08-01 59 59 59 59 59 2018-09-01 59 59 59 59 59 2018-10-01 59 59 58 59 59 2018-11-01 59 59 59 59 59 2018-12-01 59 59 59 59 59 2019-01-01 59 59 59 59 59 2019-02-01 59 59 59 59 59 2019-03-01 59 59 59 59 59 2019-04-01 59 59 59 59 59 2019-05-01 59 59 59 58 59 2019-06-01 59 59 59 59 59 2019-07-01 59 59 59 59 59 2019-08-01 59 59 59 59 59 2019-09-01 59 59 59 59 59 2019-10-01 59 59 59 59 59 2019-11-01 59 59 59 59 59 2019-12-01 59 59 59 59 59 2020-01-01 59 59 59 59 59 2020-02-01 59 59 59 59 59 2020-03-01 59 59 59 59 59 2020-04-01 59 59 59 59 59 2020-05-01 59 59 59 59 59 2020-06-01 59 59 59 59 59 2020-07-01 59 59 59 59 59 2020-08-01 59 59 59 59 59 2020-09-01 59 59 59 59 59 2020-10-01 59 59 59 59 59 2020-11-01 59 59 59 59 59
  • 37. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 37/62 Longitude Latitude Sunhine Hours Avg. Wind Speed Property prices Date 2020-12-01 59 59 59 59 59 2021-01-01 59 59 59 59 59 2021-02-01 59 59 59 59 59 2021-03-01 59 59 59 59 59 2021-04-01 59 59 59 59 59 2021-05-01 59 59 59 59 59 2021-06-01 59 59 59 59 59 2021-07-01 59 59 59 59 59 2021-08-01 59 59 59 59 59 2021-09-01 59 59 59 59 59 2021-10-01 59 59 59 59 58 2021-11-01 59 59 59 59 59 2021-12-01 59 59 59 59 59 2022-01-01 59 59 59 59 59 2022-02-01 59 59 59 59 59 2022-03-01 59 59 59 59 59 2022-04-01 59 59 59 59 59 2022-05-01 59 59 59 59 59 2022-06-01 59 59 59 59 59 2022-07-01 59 59 59 59 59 2022-08-01 59 59 59 59 59 we can from above two result that there are 59 uniqe properties's data is available to us . Also from pair plot we can see properties devided into 4 Clusters. In datasets we can see data for 56 months so we have to convert it in yearly format. Price and Wind Energy are reamaining same thorought time period for each prpoperty. We will create dataset for 59 properties. In [498… # finding uniqe properties. locations_ = df_1[['Longitude','Latitude']].drop_duplicates() locations_
  • 38. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 38/62 Out[498]: Longitude Latitude 0 0.220 0.270 1 0.280 0.230 2 0.280 0.270 3 0.350 0.200 4 0.180 0.300 5 0.310 0.320 6 0.300 0.250 7 0.200 0.200 8 0.230 0.250 9 0.210 0.210 10 0.220 0.700 11 0.280 0.680 12 0.280 0.690 13 0.350 0.700 14 0.180 0.800 15 0.310 0.750 16 0.300 0.720 17 0.200 0.770 18 0.230 0.760 19 0.210 0.740 20 0.720 0.700 21 0.640 0.680 22 0.630 0.690 23 0.680 0.700 24 0.770 0.800 25 0.770 0.750 26 0.760 0.720 27 0.740 0.770 28 0.720 0.760 29 0.700 0.740 30 0.720 0.220 31 0.640 0.260 32 0.630 0.280 33 0.680 0.250 34 0.770 0.180 35 0.770 0.310
  • 39. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 39/62 Longitude Latitude 36 0.760 0.300 37 0.740 0.200 38 0.720 0.230 39 0.700 0.210 40 0.233 0.929 41 0.617 0.514 42 0.373 0.002 43 0.864 0.838 44 0.081 0.805 45 0.124 0.413 46 0.164 0.106 47 0.137 0.710 48 0.064 0.835 49 0.160 0.173 50 0.014 0.729 51 0.025 0.472 52 0.715 0.211 53 0.808 0.505 54 0.873 0.379 55 0.856 0.100 56 0.233 0.623 57 0.567 0.727 58 0.180 0.611 In [499… #Creating Avrage Sunrise hours per month for each property. df_1.groupby(['Longitude','Latitude'])['Sunhine Hours'].mean().reset_index()
  • 40. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 40/62 Out[499]: Longitude Latitude Sunhine Hours 0 0.014 0.729 96.410714 1 0.025 0.472 94.302857 2 0.064 0.835 94.963929 3 0.081 0.805 94.248214 4 0.124 0.413 96.712857 5 0.137 0.710 98.401786 6 0.160 0.173 96.977455 7 0.164 0.106 93.440714 8 0.180 0.300 107.601830 9 0.180 0.611 94.173571 10 0.180 0.800 108.632009 11 0.200 0.200 107.670134 12 0.200 0.770 107.224554 13 0.210 0.210 105.145714 14 0.210 0.740 104.628616 15 0.220 0.270 107.102411 16 0.220 0.700 105.796607 17 0.230 0.250 106.368348 18 0.230 0.760 105.567589 19 0.233 0.623 93.926071 20 0.233 0.929 94.735714 21 0.280 0.230 107.218527 22 0.280 0.270 106.851696 23 0.280 0.680 108.749732 24 0.280 0.690 105.909911 25 0.300 0.250 106.069821 26 0.300 0.720 105.526205 27 0.310 0.320 105.487634 28 0.310 0.750 106.812723 29 0.350 0.200 106.869375 30 0.350 0.700 109.638409 31 0.373 0.002 95.600357 32 0.567 0.727 111.343214 33 0.617 0.514 92.911429 34 0.630 0.280 91.532143 35 0.630 0.690 96.505357
  • 41. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 41/62 Longitude Latitude Sunhine Hours 36 0.640 0.260 95.721429 37 0.640 0.680 93.820357 38 0.680 0.250 95.400000 39 0.680 0.700 92.683929 40 0.700 0.210 95.352857 41 0.700 0.740 94.743571 42 0.715 0.211 93.179286 43 0.720 0.220 90.614286 44 0.720 0.230 94.259643 45 0.720 0.700 93.306786 46 0.720 0.760 95.486429 47 0.740 0.200 92.203929 48 0.740 0.770 96.747500 49 0.760 0.300 93.957143 50 0.760 0.720 96.785357 51 0.770 0.180 92.680357 52 0.770 0.310 97.258929 53 0.770 0.750 95.524643 54 0.770 0.800 96.507143 55 0.808 0.505 92.385714 56 0.856 0.100 95.538929 57 0.864 0.838 93.215357 58 0.873 0.379 94.596429 In [500… df_1.groupby(['Longitude','Latitude'])['Avg. Wind Speed'].mean().reset_index()
  • 42. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 42/62 Out[500]: Longitude Latitude Avg. Wind Speed 0 0.014 0.729 3.680839 1 0.025 0.472 3.540536 2 0.064 0.835 3.565125 3 0.081 0.805 3.615589 4 0.124 0.413 3.676500 5 0.137 0.710 3.593732 6 0.160 0.173 3.511929 7 0.164 0.106 3.597750 8 0.180 0.300 3.360143 9 0.180 0.611 3.478821 10 0.180 0.800 3.196000 11 0.200 0.200 3.168286 12 0.200 0.770 3.327143 13 0.210 0.210 3.179000 14 0.210 0.740 3.243714 15 0.220 0.270 3.214571 16 0.220 0.700 3.245143 17 0.230 0.250 3.097286 18 0.230 0.760 3.140571 19 0.233 0.623 3.607393 20 0.233 0.929 3.706714 21 0.280 0.230 3.232714 22 0.280 0.270 3.204429 23 0.280 0.680 3.212714 24 0.280 0.690 3.137429 25 0.300 0.250 3.321429 26 0.300 0.720 20.975714 27 0.310 0.320 3.174714 28 0.310 0.750 3.206571 29 0.350 0.200 3.235286 30 0.350 0.700 3.258857 31 0.373 0.002 3.812143 32 0.567 0.727 3.604982 33 0.617 0.514 3.448929 34 0.630 0.280 3.697875 35 0.630 0.690 3.590196
  • 43. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 43/62 Longitude Latitude Avg. Wind Speed 36 0.640 0.260 3.705429 37 0.640 0.680 3.495375 38 0.680 0.250 3.606429 39 0.680 0.700 3.578625 40 0.700 0.210 3.546321 41 0.700 0.740 3.635679 42 0.715 0.211 3.567857 43 0.720 0.220 3.610607 44 0.720 0.230 3.617357 45 0.720 0.700 3.485571 46 0.720 0.760 3.708321 47 0.740 0.200 3.597911 48 0.740 0.770 3.554196 49 0.760 0.300 3.630857 50 0.760 0.720 3.608679 51 0.770 0.180 3.552107 52 0.770 0.310 3.673768 53 0.770 0.750 3.644357 54 0.770 0.800 3.588218 55 0.808 0.505 3.583125 56 0.856 0.100 3.668304 57 0.864 0.838 3.624429 58 0.873 0.379 3.682125 In [501… df_1.groupby(['Longitude','Latitude'])['Property prices'].mean().reset_index()
  • 44. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 44/62 Out[501]: Longitude Latitude Property prices 0 0.014 0.729 67.0 1 0.025 0.472 116.0 2 0.064 0.835 91.0 3 0.081 0.805 65.0 4 0.124 0.413 126.0 5 0.137 0.710 86.0 6 0.160 0.173 130.0 7 0.164 0.106 95.0 8 0.180 0.300 326.0 9 0.180 0.611 127.0 10 0.180 0.800 276.0 11 0.200 0.200 105.0 12 0.200 0.770 224.0 13 0.210 0.210 273.0 14 0.210 0.740 312.0 15 0.220 0.270 304.0 16 0.220 0.700 174.0 17 0.230 0.250 159.0 18 0.230 0.760 137.0 19 0.233 0.623 73.0 20 0.233 0.929 128.0 21 0.280 0.230 318.0 22 0.280 0.270 102.0 23 0.280 0.680 131.0 24 0.280 0.690 330.0 25 0.300 0.250 129.0 26 0.300 0.720 245.0 27 0.310 0.320 232.0 28 0.310 0.750 320.0 29 0.350 0.200 248.0 30 0.350 0.700 277.0 31 0.373 0.002 139.0 32 0.567 0.727 149.0 33 0.617 0.514 137.0 34 0.630 0.280 61.0 35 0.630 0.690 150.0
  • 45. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 45/62 Longitude Latitude Property prices 36 0.640 0.260 117.0 37 0.640 0.680 74.0 38 0.680 0.250 134.0 39 0.680 0.700 93.0 40 0.700 0.210 93.0 41 0.700 0.740 149.0 42 0.715 0.211 107.0 43 0.720 0.220 82.0 44 0.720 0.230 98.0 45 0.720 0.700 73.0 46 0.720 0.760 90.0 47 0.740 0.200 72.0 48 0.740 0.770 87.0 49 0.760 0.300 96.0 50 0.760 0.720 137.0 51 0.770 0.180 57.0 52 0.770 0.310 139.0 53 0.770 0.750 108.0 54 0.770 0.800 92.0 55 0.808 0.505 99.0 56 0.856 0.100 112.0 57 0.864 0.838 74.0 58 0.873 0.379 115.0 now Merging all columns together: In [502… locations_ = locations_.sort_values(by=['Longitude', 'Latitude'],) In [503… locations_['Avg Sunshine hours'] = df_1.groupby(['Longitude','Latitude'])['Sunhine In [504… locations_['Avg Wind speed'] = df_1.groupby(['Longitude','Latitude'])['Avg. Wind S In [505… locations_['Avg Price'] = df_1.groupby(['Longitude','Latitude'])['Property prices In [506… locations_.head() Out[506]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price 50 0.014 0.729 96.785357 3.608679 137.0 51 0.025 0.472 92.680357 3.552107 57.0 48 0.064 0.835 96.747500 3.554196 87.0
  • 46. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 46/62 44 0.081 0.805 94.259643 3.617357 98.0 45 0.124 0.413 93.306786 3.485571 73.0 now We have dataset of 59 properties but we do not know in which region / Area they are. In [507… # we plot the longatude vs latitude so we can see the properties. Reg = np.array(locations_[['Longitude','Latitude']]) plt.scatter(Reg[:,0],Reg[:,1]) Out[507]: <matplotlib.collections.PathCollection at 0x2101ea7df10> we can see there are 4 region . we do not have any labels available so we have to use unsupervised learning for prediction of the clusters. We will use K-means algorithm for clustering In [508… from sklearn.cluster import KMeans Kmeans = KMeans(n_clusters = 4) Kmeans.fit(Reg) Out[508]: KMeans(n_clusters=4) In [509… Reg_list = Kmeans.labels_ Reg_list Out[509]: array([3, 3, 3, 3, 0, 3, 0, 0, 0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 3, 3, 3, 0, 0, 3, 3, 0, 3, 0, 3, 0, 3, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2]) In [510… #from these we can define which region are which. Kmeans.cluster_centers_
  • 47. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 47/62 Out[510]: array([[0.2415 , 0.22814286], [0.71328571, 0.70671429], [0.73646154, 0.24076923], [0.19594444, 0.72355556]]) In [511… fig,(ax1) = plt.subplots(1, sharey=True, figsize = (5,5)) ax1.set_title('Saperate Regions using Kmeans') ax1.scatter(Reg[:,0],Reg[:,1],c= Kmeans.labels_,cmap='rainbow') Out[511]: <matplotlib.collections.PathCollection at 0x2101eacacd0> In [512… locations_['region'] = Reg_list In [513… locations_.head() Out[513]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region 50 0.014 0.729 96.785357 3.608679 137.0 3 51 0.025 0.472 92.680357 3.552107 57.0 3 48 0.064 0.835 96.747500 3.554196 87.0 3 44 0.081 0.805 94.259643 3.617357 98.0 3 45 0.124 0.413 93.306786 3.485571 73.0 0 In [514… #defining Region using Hot-Encoding. from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(handle_unknown='ignore') encoder_df = pd.DataFrame(encoder.fit_transform(locations_[['region']]).toarray()) locations_1 = locations_.join(encoder_df)
  • 48. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 48/62 locations_1.columns = ['Longitude','Latitude','Avg Sunhine Hours','Avg. Wind Speed 'SW','SE'] locations_1.head() Out[514]: Longitude Latitude Avg Sunhine Avg. Wind Avg region NE NW SW SE Hours Speed prices 50 0.014 0.729 96.785357 3.608679 137.0 3 0.0 1.0 0.0 0.0 51 0.025 0.472 92.680357 3.552107 57.0 3 0.0 0.0 1.0 0.0 48 0.064 0.835 96.747500 3.554196 87.0 3 0.0 1.0 0.0 0.0 44 0.081 0.805 94.259643 3.617357 98.0 3 0.0 0.0 1.0 0.0 45 0.124 0.413 93.306786 3.485571 73.0 0 0.0 1.0 0.0 0.0 In [515… #defining Region by lamda function: locations_['Area']= locations_['region'].apply(lambda region:"South-West" if region In [516… locations_.head() Out[516]: Longitude Latitude Avg Sunshine hours Avg Wind speed Avg Price region Area 50 0.014 0.729 96.785357 3.608679 137.0 3 South-West 51 0.025 0.472 92.680357 3.552107 57.0 3 South-West 48 0.064 0.835 96.747500 3.554196 87.0 3 South-West 44 0.081 0.805 94.259643 3.617357 98.0 3 South-West 45 0.124 0.413 93.306786 3.485571 73.0 0 South-East We want sunrise hours in yearly manner that's why we will multiply it by 12 In [517… locations_['Avg Sunshine hours'] = locations_['Avg Sunshine hours']*12 Now Predicting energy per one m^2 using coeficent and intercept of Linear Regression In [518… locations_['Pridected Energy'] = lm.coef_[0]*locations_['Avg Sunshine hours'] + lm locations_.head()
  • 49. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 49/62 Out[518]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 50 0.014 0.729 1161.424286 3.608679 137.0 3 South- West 247.577221 51 0.025 0.472 1112.164286 3.552107 57.0 3 South- West 238.620720 48 0.064 0.835 1160.970000 3.554196 87.0 3 South- West 247.494623 44 0.081 0.805 1131.115714 3.617357 98.0 3 South- West 242.066487 45 0.124 0.413 1119.681429 3.485571 73.0 0 South- East 239.987494 Now we will group data by Region and do the analysis. In [519… sns.countplot(x='Area',data = locations_) Out[519]: <AxesSubplot:xlabel='Area', ylabel='count'> In [520… locations_.groupby(['Area']).count() Out[520]: Longitude Latitude Avg Sunshine Avg Wind Avg region Pridected hours speed Price Energy Area North- East 13 13 13 13 13 13 13 North- West 14 14 14 14 14 14 14
  • 50. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 50/62 South- East 14 14 14 14 14 14 14 South- West 18 18 18 18 18 18 18 In [521… sns.barplot(x = 'Area', y = 'Pridected Energy',data = locations_, estimator = max) Out[521]: <AxesSubplot:xlabel='Area', ylabel='Pridected Energy'> In [522… locations_.groupby(['Area'], sort=False)['Pridected Energy'].max() Out[522]: Area South-West 273.424860 South-East 271.177163 North-West 273.681714 North-East 279.340308 Name: Pridected Energy, dtype: float64 In [523… sns.barplot(x = 'Area', y = 'Avg Price',data = locations_, estimator = max) Out[523]: <AxesSubplot:xlabel='Area', ylabel='Avg Price'>
  • 51. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 51/62 In [524… sns.barplot(x = 'Area', y = 'Avg Price',data = locations_, estimator = min) Out[524]: <AxesSubplot:xlabel='Area', ylabel='Avg Price'> In [525… locations_.groupby(['Area'], sort=False)['Avg Price'].min() Out[525]: Area
  • 52. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 52/62 South-West 57.0 South-East 65.0 North-West 74.0 North-East 61.0 Name: Avg Price, dtype: float64 In [526… NW_ = locations_[locations_['Area'] == 'North-West'] NE_ = locations_[locations_['Area'] == 'North-East'] SE_ = locations_[locations_['Area'] == 'South-East'] SW_ = locations_[locations_['Area'] == 'South-West'] For scenario 1 we will see how we can optimize cost and energy In [527… NW_ Out[527]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 57 0.567 0.727 1118.584286 3.624429 74.0 1 North- West 239.788010 41 0.617 0.514 1136.922857 3.635679 149.0 1 North- West 243.122347 22 0.630 0.690 1282.220357 3.204429 102.0 1 North- West 269.540482 21 0.640 0.680 1286.622321 3.232714 318.0 1 North- West 270.340851 23 0.680 0.700 1304.996786 3.212714 131.0 1 North- West 273.681714 29 0.700 0.740 1282.432500 3.235286 248.0 1 North- West 269.579054 20 0.720 0.700 1136.828571 3.706714 128.0 1 North- West 243.105204 28 0.720 0.760 1281.752679 3.206571 320.0 1 North- West 269.455448 27 0.740 0.770 1265.851607 3.174714 232.0 1 North- West 266.564299 26 0.760 0.720 1266.314464 20.975714 245.0 1 North- West 266.648457 25 0.770 0.750 1272.837857 3.321429 129.0 1 North- West 267.834546 24 0.770 0.800 1270.918929 3.137429 330.0 1 North- West 267.485645 53 0.808 0.505 1146.295714 3.644357 108.0 1 North- West 244.826530 43 0.864 0.838 1087.371429 3.610607 82.0 1 North- West 234.112858 In [528… # we wiil sort value of colums price and energy and choose the row where price is NW_sorted = NW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True,
  • 53. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 53/62 best_row_nw = NW_sorted.head(1) best_row_nw Avg Sunshine Avg Wind Avg Pridected Out[528]: Longitude Latitude region Area hours speed Price Energy North- 57 0.567 0.727 1118.584286 3.624429 74.0 1 239.78801 West In [529… SE_ Out[529]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 45 0.124 0.413 1119.681429 3.485571 73.0 0 South- East 239.987494 49 0.160 0.173 1127.485714 3.630857 96.0 0 South- East 241.406477 46 0.164 0.106 1145.837143 3.708321 90.0 0 South- East 244.743152 4 0.180 0.300 1160.554286 3.676500 126.0 0 South- East 247.419037 7 0.200 0.200 1121.288571 3.597750 95.0 0 South- East 240.279706 9 0.210 0.210 1130.082857 3.478821 127.0 0 South- East 241.878692 0 0.220 0.270 1156.928571 3.680839 67.0 0 South- East 246.759806 8 0.230 0.250 1291.221964 3.360143 326.0 0 South- East 271.177163 1 0.280 0.230 1131.634286 3.540536 116.0 0 South- East 242.160774 2 0.280 0.270 1139.567143 3.565125 91.0 0 South- East 243.603134 6 0.300 0.250 1163.729455 3.511929 130.0 0 South- East 247.996349 5 0.310 0.320 1180.821429 3.593732 86.0 0 South- East 251.104029 3 0.350 0.200 1130.978571 3.615589 65.0 0 South- East 242.041552 42 0.373 0.002 1118.151429 3.567857 107.0 0 South- East 239.709308 In [530… SE_sorted = SE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True, best_row_se = SE_sorted.head(1) best_row_se Out[530]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy South-
  • 54. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 54/62 3 0.35 0.2 1130.978571 3.615589 65.0 0 242.041552 East In [531… SW_ Out[531]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 50 0.014 0.729 1161.424286 3.608679 137.0 3 South- West 247.577221 51 0.025 0.472 1112.164286 3.552107 57.0 3 South- West 238.620720 48 0.064 0.835 1160.970000 3.554196 87.0 3 South- West 247.494623 44 0.081 0.805 1131.115714 3.617357 98.0 3 South- West 242.066487 47 0.137 0.710 1106.447143 3.597911 72.0 3 South- West 237.581223 58 0.180 0.611 1135.157143 3.682125 115.0 3 South- West 242.801303 14 0.180 0.800 1255.543393 3.243714 312.0 3 South- West 264.690050 17 0.200 0.770 1276.420179 3.097286 159.0 3 South- West 268.485888 19 0.210 0.740 1127.112857 3.607393 73.0 3 South- West 241.338684 10 0.220 0.700 1303.584107 3.196000 276.0 3 South- West 273.424860 18 0.230 0.760 1266.811071 3.140571 137.0 3 South- West 266.738750 56 0.233 0.623 1146.467143 3.668304 112.0 3 South- West 244.857699 40 0.233 0.929 1144.234286 3.546321 93.0 3 South- West 244.451719 11 0.280 0.680 1292.041607 3.168286 105.0 3 South- West 271.326191 12 0.280 0.690 1286.694643 3.327143 224.0 3 South- West 270.354001 16 0.300 0.720 1269.559286 3.245143 174.0 3 South- West 267.238433 15 0.310 0.750 1285.228929 3.214571 304.0 3 South- West 270.087503 13 0.350 0.700 1261.748571 3.179000 273.0 3 South- West 265.818281 In [532… SW_sorted = SW_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True, best_row_sw = SW_sorted.head(1) best_row_sw Out[532]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy
  • 55. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 55/62 South- 51 0.025 0.472 1112.164286 3.552107 57.0 3 238.62072 West In [533… NE_ Out[533]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 32 0.630 0.280 1336.118571 3.604982 149.0 2 North- East 279.340308 31 0.640 0.260 1147.204286 3.812143 139.0 2 North- East 244.991727 33 0.680 0.250 1114.937143 3.448929 137.0 2 North- East 239.124883 39 0.700 0.210 1112.207143 3.578625 93.0 2 North- East 238.628512 52 0.715 0.211 1167.107143 3.673768 139.0 2 North- East 248.610484 30 0.720 0.220 1315.660909 3.258857 277.0 2 North- East 275.620676 38 0.720 0.230 1144.800000 3.606429 134.0 2 North- East 244.554577 37 0.740 0.200 1125.844286 3.495375 74.0 2 North- East 241.108031 36 0.760 0.300 1148.657143 3.705429 117.0 2 North- East 245.255887 34 0.770 0.180 1098.385714 3.697875 61.0 2 North- East 236.115486 35 0.770 0.310 1158.064286 3.590196 150.0 2 North- East 246.966303 55 0.856 0.100 1108.628571 3.583125 99.0 2 North- East 237.977853 54 0.873 0.379 1158.085714 3.588218 92.0 2 North- East 246.970199 In [534… NE_sorted = NE_.sort_values(by=["Avg Price", "Pridected Energy"], ascending=[True, best_row_ne = NE_sorted.head(1) best_row_ne Out[534]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy North- 34 0.77 0.18 1098.385714 3.697875 61.0 2 236.115486 East In [535… Scenari0_1 = pd.concat([best_row_nw, best_row_se, best_row_sw, best_row_ne], axis=0 In [536… Scenari0_1
  • 56. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 56/62 Out[536]: Longitude Latitude Avg Sunshine Avg Wind Avg region Area Pridected hours speed Price Energy 57 0.567 0.727 1118.584286 3.624429 74.0 1 North- West 239.788010 3 0.350 0.200 1130.978571 3.615589 65.0 0 South- East 242.041552 51 0.025 0.472 1112.164286 3.552107 57.0 3 South- West 238.620720 34 0.770 0.180 1098.385714 3.697875 61.0 2 North- East 236.115486 In [537… # Calculate Energy Total_Energy = (238.620720 * 3000) + (236.115486 * 2000) + (242.041552 * 2000) + (2 In [538… Total_Energy = round(Total_Energy) Total_Energy Out[538]: 2031858 In [539… #Cost Total_cost = round((3000 * 157) + (2000 * 161) + (2000 * 165) + (1500 * 174)) Total_cost Out[539]: 1384000 In [540… print(" Total Energy genrated in KWH/a:",Total_Energy) print(" Total cost in Euro:",Total_cost) print(" total area we occupying : 8500 m^2") Total Energy genrated in KWH/a: 2031858 Total cost in Euro: 1384000 total area we occupying : 8500 m^2 Scenario 2 We can also use above method to solve scenario 2 but we will try it using Hierarchical clustering : In [541… from sklearn.cluster import AgglomerativeClustering # Extract the two columns of features that you want to use for clustering NW_copmare = NW_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_NW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_NW.fit(NW_copmare) # Predict the clusters for each data point pred = cluster_NW.fit_predict(NW_copmare) # Create a scatter plot of the clusters
  • 57. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 57/62 plt.scatter(NW_copmare['Avg Price'], NW_copmare['Pridected Energy'], c=pred, cmap= plt.show() In [542… pred Out[542]: array([3, 0, 1, 2, 1, 4, 0, 2, 4, 4, 1, 2, 0, 3], dtype=int64) In [543… SE_copmare = SE_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_SE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_NW.fit(SE_copmare) # Predict the clusters for each data point pred = cluster_SE.fit_predict(SE_copmare) # Create a scatter plot of the clusters plt.scatter(SE_copmare['Avg Price'], SE_copmare['Pridected Energy'], c=pred, cmap= plt.show()
  • 58. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 58/62 In [544… SW_copmare = SW_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_SW = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_SW.fit(SW_copmare) # Predict the clusters for each data point pred = cluster_SW.fit_predict(SW_copmare) # Create a scatter plot of the clusters plt.scatter(SW_copmare['Avg Price'], SW_copmare['Pridected Energy'], c=pred, cmap= plt.show()
  • 59. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 59/62 In [545… NE_copmare = NE_[['Avg Price','Pridected Energy']] # Create an instance of the AgglomerativeClustering class cluster_NE = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='w # Fit the model to the data cluster_NE.fit(NE_copmare) # Predict the clusters for each data point pred = cluster_NE.fit_predict(NE_copmare) # Create a scatter plot of the clusters plt.scatter(NE_copmare['Avg Price'], NE_copmare['Pridected Energy'], c=pred, cmap= plt.show()
  • 60. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 60/62 WE will use all the optimal point which have best energy and cheap price In [546… # SW + NW + NE + SE Energy_op = (252*2000) + (272 * 3000) + (273 * 3000) + (280 *2000) Energy_op Out[546]: 2699000 Now we wil use point which have highest energy In [547… Energy_hi = (271*2000) + (273.5 * 3000) + (273 * 3000) + (280 *2000) Energy_hi Out[547]: 2741500.0 In [548… cost = (376*3000) + (249*2000) + (231 * 3000) + (426*2000) cost Out[548]: 3171000
  • 61. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 61/62 Conclusion Scenario 1 We can full fil the demand of 2 milllion kwh/a energy Easily and we dont even need 2 million budget. We are also not using all the area that proposed and area with highest price of the cheapest price "North West " we are just using half land there. so it is saving money. Scenario 2 First we used the point that are second highest energy and cheap compare to highest energy point, but we are not full fiiling energy demand Now we are taking the point which have highest energy without cosidering cost . although it is not full filling demand. We need more land and little more money.
  • 62. 1/9/23, 1:39 PM Final_1 localhost:8888/nbconvert/html/Desktop/AI Project/Final_1.ipynb?download=false 62/62 Bibliography Burkov, A. (2020). The Hundred -Page machine learning book.