Databricks upgrades platform with Spark 4.0
Databricks has released Runtime 17.3 LTS, which is powered by Apache Spark 4.0.0. The update includes performance improvements, enhanced REST API endpoints for Unity Catalog and Iceberg tables, and makes Scala 2.13 the default language, impacting long-term migration plans for data engineering teams.
- The upgrade to Spark 4.0 introduces a native `VARIANT` data type, designed for efficiently storing and querying semi-structured data like JSON without costly parsing. This can lead to up to 8x faster query execution on nested fields, a significant improvement for handling complex policy or claims data. - For MLOps, Spark Connect is now generally available and supports the full MLlib API, allowing machine learning models to be trained remotely from a lightweight Python client (~1.5 MB). This architecture separates the client application from the Spark cluster, enabling more flexible and scalable ML pipeline deployments. - Python functionality has been expanded with a new Data Source API, which allows developers to create custom connectors entirely in Python, removing the previous need for Java or Scala knowledge. Additionally, a unified profiling system for PySpark User-Defined Functions (UDFs) has been introduced to help identify performance and memory bottlenecks. - The default switch to ANSI SQL mode enforces stricter data quality rules, throwing runtime exceptions for invalid operations rather than returning nulls. This aligns Spark's behavior with traditional databases and can improve data integrity for risk modeling and actuarial analysis. - Under the hood, Scala 2.13 brings a completely redesigned collections library and a 5-10% improvement in compiler performance over Scala 2.12. For teams with existing Scala codebases, this change requires careful planning for migration due to potential incompatibilities in collections handling. - Structured Streaming receives a new `transformWithState` operator for more flexible and efficient stateful stream processing, along with the ability to directly query a stream's internal state as a table for improved debugging and observability. - While Spark 4.0 brings performance gains through features like enhanced Adaptive Query Execution, Databricks further accelerates workloads with its proprietary Photon engine. Photon is a C++ based vectorized execution engine that runs transparently with Spark APIs and has shown to provide 3x-8x speedups for SQL and DataFrame workloads on average.