Skip to content

smk's notes

Language Specifics : Python (Pyspark) and R (SparkR and sparklyr)

mightyjoe781/notes

1
0

Language Specifics : Python (Pyspark) and R (SparkR and sparklyr)🔗

NOTE: Here I only write about Pyspark as It is scope of my requirements, probably try to Read Book for more on R.

Pyspark🔗

Fundamental PySpark Differences🔗

While using Structured APIs, Python runs as fast as Scala except when using UDFs due to serialization costs.
The fundamental idea is that Spark is going to have to work a lot harder converting information from something that Spark and the JVM can understand to Python and back again. This includes both functions as well as data and is a process known as serialization.

Pandas Integration🔗

PySpark works across programming models.
A common pattern is to perform very-large scale ETL work with Spark and then collect the result to driver and use Pandas to manipulate it further

import pandas as pd
df = pd.DataFrame({"first":range(200), "second":range(50,250)})
sparkDF = spark.createDataFrame(df)
newPDF = sparkDF.toPandas()
newPDF.head()

... NOTE Part on R is left out ...