Structured API Overview๐
- The Structured APIs allows manipulation of all sorts of data from CSV(semi-structured) or Parquet(highly structured because schema validation happens at the write time)
- APIs refer to three core types of distributed collection APIs
- Datasets
- DataFrames
- SQL tables and views
DataFrames and Datasets๐
Spark has two notions of structured collections : DataFrames and Datasets.
- Both are (distributed) table like collections with rows and columns
- Each column must have same number of rows as all other columns
- Each column type has type information that must be consistent in every row in collection.
- To spark both represent immutable, lazily evaluated plans that specify what operations to apply to data residing at a location to generate some output.
Schemas๐
- Schema defines the column names and types of a Dataframe
- Can be defined or read from data source (schema on read)
Overview of Structured Spark Types๐
- Spark is effectively a programming languages that uses a engine Catalyst that maintains its own type information through the planning and processing of the work.
- Even if we use dynamically typed language most of these operation still operate strictly on Spark Types, not python.
DataFrames vs DataSets๐
Within structured APIs, there are two more APIs
- untyped DataFrames : slight inaccurate representation, they have types but they are managed by Spark and only checks whether those types line up to those specified in the schema at runtime.
- typed Datasets : checks whether types conform to specifications at runtime. Only available in JVM based languages like Scala and Java.
Usages in Languages
- To spark(in Scala) DataFrames are simply Datasets of Type
Row
. Making it highly specialized and optimized rather than using JVM types and causing high garbage-collection and object instantiation costs. - To spark(in Python or R), there is no such thing as Datasets, everything is a DataFrame and therefore we always operate on that optimized format.
- NOTE : The
Row
type is Sparkโs internal representation of its optimized in-memory format for computation
Columns๐
- represent a simple type like an integer or a string, a complex type like an air or map or null value.
Rows๐
- A row is nothing more than a record of data. Each record in a DataFrame must be of type
Row
. e.g.spark.range(2).collect()
Spark Types๐
- How to declare a column to be of a specific type.
- python types at times have certain requirements. Refer this : http://bit.ly/2EdflXW
Data type | Value type in Python | API to access or create a data type |
---|---|---|
ByteType | int or long. Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Ensure that numbers are within the range of โ128 to 127. | ByteType() |
ShortType | int or long. Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Ensure that numbers are within the range of โ32768 to 32767. | ShortType() |
IntegerType | int or long. Note: Python has a lenient definition of โinteger.โ Numbers that are too large will be rejected by Spark SQL if you use the IntegerType(). Itโs best practice to use LongType. | IntegerType() |
LongType | long. Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Ensure that numbers are within the range of โ9223372036854775808 to 9223372036854775807. Otherwise, convert data to decimal.Decimal and use DecimalType. | LongType() |
FloatType | float. Note: Numbers will be converted to 4-byte single-precision floating-point numbers at runtime. | FloatType() |
DoubleType | float | DoubleType() |
DecimalType | decimal.Decimal | DecimalType() |
StringType | string | StringType() |
BinaryType | bytearray | BinaryType() |
BooleanType | bool | BooleanType() |
TimestampType | datetime.datetime | TimestampType() |
DateType | datetime.date | DateType() |
ArrayType | list, tuple, or array | ArrayType(elementType, [containsNull]). Note: The default value of containsNull is True. |
MapType | dict | MapType(keyType, valueType, [valueContainsNull]). Note: The default value of valueContainsNull is True. |
StructType | list or tuple | StructType(fields). Note: fields is a list of StructFields. Also, fields with the same name are not allowed. |
StructField | The value type in Python of the data type of this field (for example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, [nullable]). Note: The default value of nullable is True. |
Overview of Structured API Executions๐
- Execution of a single structured API query from user code to executed code.
- Write DataFrame/Dataset/SQL Code
- If valid code, spark converts this to a Logical Plan
- Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way.
- Spark then executes this Physical Plan (RDD manipulations) on the cluster.
Logical Planning๐
- Represents a set of abstract transformations that do not refer to executors or drivers, itโs purely to convert the userโs set of expressions into most optimized versions
- It does this by converting user code into a unresolved logical plan (valid code still needs to be checked for unknown column references etc.)
- Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and tables in the analyzer
Physical Planning (Spark Plan)๐
- specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model
- Physical planning results in a series of RDDs and transformations. This result is why you might have heard Spark referred to as a compilerโit takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations for you.
Execution๐
- Spark runs all of this code over RDDs
- Spark performs further optimizations at runtime, generating native Java bytecode that can remove entire tasks or stages during execution. Finally the result is returned to the user.