Differences from Scala and Python API
Source:vignettes/differences-from-scala-and-python-api.Rmd
differences-from-scala-and-python-api.Rmd
Public naming convention
All exported methods and functions, with exception to compatibility
reader and writer (read.delta
and
write.delta
), use dlt_
prefix followed by snake case name.
For example, dlt
equivalent of the upstream
convertToDelta
, is dlt_convert_to_delta
.
Argument order
Python Delta Lake API provides a number of methods with an optional
condition
, followed by mandatory set
(mapping
between column name and update expression). For example,
DeltaMergeBuilder.whenMatchedUpdate
can be called with
either condtion
and set
:
@overload
def whenMatchedUpdate(
self, condition: OptionalExpressionOrColumn, set: ColumnMapping
-> "DeltaMergeBuilder": ... )
or only set
, passed as a keyword argument
@overload
def whenMatchedUpdate(
self, *, set: ColumnMapping
-> "DeltaMergeBuilder": ... )
This follows Scala information flow, where
merge operation is defined using interleaved operations on
DeltaMergeBuilder
(whenMatched
,
whenNotMached
) and DeltaMerge*ActionBuilder
(i.e. insert
, insertAll
, update
,
updateAll
, delete
)
deltaTable.as("target")
.merge(
.as("source"),
source"target.key = source.key")
.whenMatched("source.value < target.value") // Returns DeltaMergeMatchedActionBuilder
.update(Map(
"value" -> expr("source.value")
)) // Returns DeltaMergeBuilder
.whenNotMatched("source.value < target.value") // Returns DeltaMergeNotMatchedActionBuilder
.insert(Map(
"key" -> expr("source.key")
"value" -> lit(0)
)) // Returns DeltaMergeBuilder
.execute()
so condition
, if present, is always provided before
set
.
In contrast, dlt
provides composite methods modeled
after Python API, but mandatory arguments are placed before the optional
ones. So Python’s
def whenMatchedUpdate(
self,
= None,
condition: OptionalExpressionOrColumn set: OptionalColumnMapping = None
-> "DeltaMergeBuilder": ... )
is mapped to
setGeneric(
"dlt_when_matched_update",
function(dmb, set, condition) {
standardGeneric("dlt_when_matched_update")
}
)
in dlt
.
Return types
In Scala and Python API, methods used for their side effects, with
exception to DeltaTableBuilder.execute
(DeltaMergeBuilder.execute
, DeltaTable.update
,
DeltaTable.delete
, DeltaTable.generate
, etc.),
are Unit()
or None
respectively.
Additionally, DeltaTable.vacuum
returns an empty Spark
DataFrame
.
In contrast, dlt
invisibly returns an applicable
instance of DeltaTable
(target
DeltaTable
for merge operations, DeltaTable
,
on which method has been called, otherwise).
As a result, it is possible to chain calls like these:
dlt_for_path("/tmp/target") %>%
dlt_delete("id in (1, 3, 5)") %>%
dlt_update(list(ind = "-ind"), "key = 'a'") %>%
dlt_alias("target") %>%
dlt_merge(alias(source, "source"), "source.id = target.id") %>%
dlt_when_not_matched_insert_all() %>%
dlt_execute() %>%
dlt_show()
Please keep in mind that, while convenient to write, such code can be harder to reason about and recover in case of coding error. Use with caution.
DeltaTableBuilder semantics
Scala and Python DeltaTableBuilder
use a mutable state
to build table definition. As a result, table partial table builder
definitions are not suitable for reuse. For example, the following Scala
code:
import org.apache.spark.sql.types._
val parentBuilder = DeltaTable
.create(spark)
.addColumn("id", "integer")
.addColumns(StructType(Seq(
StructField("key", StringType), StructField("value", DoubleType))
))
val firstChildBuilder = parentBuilder
.addColumn("first", "decimal(10, 2)")
.tableName("first_child_table")
val secondChildBuilder = parentBuilder
.addColumn("second", "boolean")
.tableName("second_child_table")
.execute()
firstChildBuilder.execute() secondChildBuilder
fails with
org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table default.second_child_table already exists
In contrast, dlt
builder is free of side effects, and
can be easily reused:
parent_builder <- dlt_create() %>%
dlt_add_column("id", "integer") %>%
dlt_add_columns(structType("key string, value double"))
first_child_builder <- parent_builder %>%
dlt_table_name("first_child_table") %>%
dlt_add_column("first", "decimal(10, 2)")
second_child_builder <- parent_builder %>%
dlt_table_name("second_child_table") %>%
dlt_add_column("second", "boolean")
first_child_builder %>%
dlt_execute() %>%
dlt_to_df() %>%
schema()
# StructType
# |-name = "id", type = "IntegerType", nullable = TRUE
# |-name = "key", type = "StringType", nullable = TRUE
# |-name = "value", type = "DoubleType", nullable = TRUE
# |-name = "first", type = "DecimalType(10,2)", nullable = TRUE
second_child_builder %>%
dlt_execute() %>%
dlt_to_df() %>%
schema()
# StructType
# |-name = "id", type = "IntegerType", nullable = TRUE
# |-name = "key", type = "StringType", nullable = TRUE
# |-name = "value", type = "DoubleType", nullable = TRUE
# |-name = "second", type = "BooleanType", nullable = TRUE