Language Specifics : Python (Pyspark) and R (SparkR and sparklyr)🔗
NOTE: Here I only write about Pyspark as It is scope of my requirements, probably try to Read Book for more on R.
Pyspark🔗
Fundamental PySpark Differences🔗
- While using Structured APIs, Python runs as fast as Scala except when using UDFs due to serialization costs.
- The fundamental idea is that Spark is going to have to work a lot harder converting information from something that Spark and the JVM can understand to Python and back again. This includes both functions as well as data and is a process known as serialization.
Pandas Integration🔗
- PySpark works across programming models.
- A common pattern is to perform very-large scale ETL work with Spark and then collect the result to driver and use Pandas to manipulate it further
import pandas as pd
df = pd.DataFrame({"first":range(200), "second":range(50,250)})
sparkDF = spark.createDataFrame(df)
newPDF = sparkDF.toPandas()
newPDF.head()
... NOTE Part on R is left out ...