Skip to content

Language Specifics : Python (Pyspark) and R (SparkR and sparklyr)🔗

NOTE: Here I only write about Pyspark as It is scope of my requirements, probably try to Read Book for more on R.

image

Pyspark🔗

Fundamental PySpark Differences🔗

  • While using Structured APIs, Python runs as fast as Scala except when using UDFs due to serialization costs.
  • The fundamental idea is that Spark is going to have to work a lot harder converting information from something that Spark and the JVM can understand to Python and back again. This includes both functions as well as data and is a process known as serialization.

Pandas Integration🔗

  • PySpark works across programming models.
  • A common pattern is to perform very-large scale ETL work with Spark and then collect the result to driver and use Pandas to manipulate it further
import pandas as pd
df = pd.DataFrame({"first":range(200), "second":range(50,250)})
sparkDF = spark.createDataFrame(df)
newPDF = sparkDF.toPandas()
newPDF.head()

... NOTE Part on R is left out ...