Differences from Scala and Python API
Source:vignettes/differences-from-scala-and-python-api.Rmd
differences-from-scala-and-python-api.RmdPublic naming convention
All exported methods and functions, with exception to compatibility
reader and writer (read.delta and
write.delta), use dlt_ prefix followed by snake case name.
For example, dlt equivalent of the upstream
convertToDelta, is dlt_convert_to_delta.
Argument order
Python Delta Lake API provides a number of methods with an optional
condition, followed by mandatory set (mapping
between column name and update expression). For example,
DeltaMergeBuilder.whenMatchedUpdate can be called with
either condtion and set:
@overload
def whenMatchedUpdate(
self, condition: OptionalExpressionOrColumn, set: ColumnMapping
) -> "DeltaMergeBuilder": ...or only set, passed as a keyword argument
@overload
def whenMatchedUpdate(
self, *, set: ColumnMapping
) -> "DeltaMergeBuilder": ...This follows Scala information flow, where
merge operation is defined using interleaved operations on
DeltaMergeBuilder (whenMatched,
whenNotMached) and DeltaMerge*ActionBuilder
(i.e. insert, insertAll, update,
updateAll, delete)
deltaTable
.as("target")
.merge(
source.as("source"),
"target.key = source.key")
.whenMatched("source.value < target.value") // Returns DeltaMergeMatchedActionBuilder
.update(Map(
"value" -> expr("source.value")
)) // Returns DeltaMergeBuilder
.whenNotMatched("source.value < target.value") // Returns DeltaMergeNotMatchedActionBuilder
.insert(Map(
"key" -> expr("source.key")
"value" -> lit(0)
)) // Returns DeltaMergeBuilder
.execute()so condition, if present, is always provided before
set.
In contrast, dlt provides composite methods modeled
after Python API, but mandatory arguments are placed before the optional
ones. So Python’s
def whenMatchedUpdate(
self,
condition: OptionalExpressionOrColumn = None,
set: OptionalColumnMapping = None
) -> "DeltaMergeBuilder": ...is mapped to
setGeneric(
"dlt_when_matched_update",
function(dmb, set, condition) {
standardGeneric("dlt_when_matched_update")
}
)in dlt.
Return types
In Scala and Python API, methods used for their side effects, with
exception to DeltaTableBuilder.execute
(DeltaMergeBuilder.execute, DeltaTable.update,
DeltaTable.delete, DeltaTable.generate, etc.),
are Unit() or None respectively.
Additionally, DeltaTable.vacuum returns an empty Spark
DataFrame.
In contrast, dlt invisibly returns an applicable
instance of DeltaTable (target
DeltaTable for merge operations, DeltaTable,
on which method has been called, otherwise).
As a result, it is possible to chain calls like these:
dlt_for_path("/tmp/target") %>%
dlt_delete("id in (1, 3, 5)") %>%
dlt_update(list(ind = "-ind"), "key = 'a'") %>%
dlt_alias("target") %>%
dlt_merge(alias(source, "source"), "source.id = target.id") %>%
dlt_when_not_matched_insert_all() %>%
dlt_execute() %>%
dlt_show()Please keep in mind that, while convenient to write, such code can be harder to reason about and recover in case of coding error. Use with caution.
DeltaTableBuilder semantics
Scala and Python DeltaTableBuilder use a mutable state
to build table definition. As a result, table partial table builder
definitions are not suitable for reuse. For example, the following Scala
code:
import org.apache.spark.sql.types._
val parentBuilder = DeltaTable
.create(spark)
.addColumn("id", "integer")
.addColumns(StructType(Seq(
StructField("key", StringType), StructField("value", DoubleType))
))
val firstChildBuilder = parentBuilder
.addColumn("first", "decimal(10, 2)")
.tableName("first_child_table")
val secondChildBuilder = parentBuilder
.addColumn("second", "boolean")
.tableName("second_child_table")
firstChildBuilder.execute()
secondChildBuilder.execute()fails with
org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table default.second_child_table already exists
In contrast, dlt builder is free of side effects, and
can be easily reused:
parent_builder <- dlt_create() %>%
dlt_add_column("id", "integer") %>%
dlt_add_columns(structType("key string, value double"))
first_child_builder <- parent_builder %>%
dlt_table_name("first_child_table") %>%
dlt_add_column("first", "decimal(10, 2)")
second_child_builder <- parent_builder %>%
dlt_table_name("second_child_table") %>%
dlt_add_column("second", "boolean")
first_child_builder %>%
dlt_execute() %>%
dlt_to_df() %>%
schema()
# StructType
# |-name = "id", type = "IntegerType", nullable = TRUE
# |-name = "key", type = "StringType", nullable = TRUE
# |-name = "value", type = "DoubleType", nullable = TRUE
# |-name = "first", type = "DecimalType(10,2)", nullable = TRUE
second_child_builder %>%
dlt_execute() %>%
dlt_to_df() %>%
schema()
# StructType
# |-name = "id", type = "IntegerType", nullable = TRUE
# |-name = "key", type = "StringType", nullable = TRUE
# |-name = "value", type = "DoubleType", nullable = TRUE
# |-name = "second", type = "BooleanType", nullable = TRUE