2. Introduction
PySpark is a Spark library written in Python to run Python application using Apache Spark
capabilities, using PySpark we can run applications parallelly on the distributed cluster
(multiple nodes).
In real-time, PySpark has used a lot in the machine learning & Data scientists community.
Spark runs operations on billions and trillions of data on distributed clusters 100 times faster
than the traditional python applications.
5. Advantages of PySpark
• PySpark is a general-purpose, in-memory, distributed processing engine that allows you
to process data efficiently in a distributed fashion.
• Applications running on PySpark are 100x faster than traditional systems.
• PySpark natively has machine learning and graph libraries.
• You will get great benefits using PySpark for data ingestion pipelines.
• PySpark also is used to process real-time data using Streaming and Kafka.
6. PySpark Architecture
Apache Spark works in a master-slave architecture where the master is called “Driver” and
slaves are called “Workers”. When you run a Spark application, Spark Driver creates a
context that is an entry point to your application, and all operations (transformations and
actions) are executed on worker nodes, and the resources are managed by Cluster
Manager.
7. Let’s begin
Accessing PySpark Environtmen in PH Server (using VPN) :
1. Open cmd/terminal
2. SSH to 192.168.29.51 (root/rahasia2021)
3. Create your own screen
4. Type this Command “su hdfs -c 'pyspark --master yarn’” & Enter
5. If success you can see like this :
8. RDD (Resilent Distributed Datasets)
RDD is a fundamental data structure of
PySpark, It is an immutable distributed
collection of objects. Each dataset in RDD
is divided into logical partitions, which
may be computed on different nodes of
the cluster.
More details :
https://sparkbyexamples.com/pyspark-rdd/
9. Practice
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com").getOrCreate()
sparkContext=spark.sparkContext
rdd=sparkContext.parallelize([1,2,3,4,5])
rddCollect = rdd.collect()
print("Number of Partitions: "+str(rdd.getNumPartitions()))
print("Action: First element: "+str(rdd.first()))
print(rddCollect)
More details :
https://sparkbyexamples.
com/pyspark/pyspark-
parallelize-create-rdd/
10. PySpark DataFrame
You can manually create a PySpark DataFrame using toDF() and createDataFrame()
methods, both these function takes different signatures in order to create DataFrame from
existing RDD, list, and DataFrame.
You can also create PySpark DataFrame from data sources like TXT, CSV, JSON, ORV, Avro,
Parquet, XML formats by reading from HDFS, S3, DBFS, Azure Blob file systems e.t.c.
Finally, PySpark DataFrame also can be created by reading data from RDBMS Databases
and NoSQL databases.
14. Practice (Save Dataframe)
Save as CSV :
df.repartition(1).write.format('com.databricks.spark.csv').save('/home/o
z/exercise/data_csv', header = 'true', mode='overwrite', nullValue=None,
sep=',')
Save as JSON :
df.write.mode(SaveMode.Overwrite).json('/home/oz/exercise/data_json')
15. Command Example
su hdfs -c "/opt/anaconda3/bin/spark-submit --master yarn --num-executors 2
--conf "spark.driver.memory=16g" /home/oz/exercise/csv_check.py"
~~ Thank you 🙏🏻 ~~