This document provides an overview of the new interval data types introduced in Apache Spark 3.2 to support ANSI SQL INTERVAL types. It describes the limitations of the previous CalendarIntervalType and introduces the new YearMonthIntervalType and DayTimeIntervalType. Key points covered include the properties and capabilities of the new types, how to create and manipulate interval columns, and compatibility with external Java types and SQL standards. The document also discusses milestone achievements and ongoing work to improve interval type support in Spark.
3. Agenda
▪ Overview of new interval
types in Spark 3.2
▪ Limitations of
CalendarIntervalType
▪ Year-Month Interval
▪ Day-Time Interval
4. SPARK-27790
Support ANSI SQL INTERVAL types
• Spark SQL 3.2 releases two
new Catalyst’s types: year-
month interval and day-
time interval types
• CalendarIntervalType is not
recommended to use, and
will be deprecated.
11. New Catalyst types in Apache Spark 3.2
▪ Precision: months
▪ Comparable and orderable
▪ Value size: 4 bytes
▪ Minimal value:
INTERVAL ‘-178956970-8’ YEAR TO MONTH
▪ Maximum value:
INTERVAL ‘178956970-7’ YEAR TO MONTH
▪ Precision: microseconds
▪ Comparable and orderable
▪ Value size: 8 bytes
▪ Minimal value:
INTERVAL ‘106751991 04:00:54.775807’
DAY TO SECOND
▪ Maximum value:
INTERVAL ‘-106751991 04:00:54.775808’
DAY TO SECOND
• YearMonthIntervalType • DayTimeIntervalType
12. Creation of interval columns
▪ Interval literals:
INTERVAL ‘1-1’ YEAR TO
MONTH
INTERVAL ‘1 02:03:04’ DAY
TO SECOND
▪ Casting string to interval
types:
$”col”
.cast(YearMonthIntervalTyp
e)
• Parallelize collections of
java.time.Period:
Seq(Period.ofDays(10)).toDS
• From collection of
java.time.Duration:
Seq(Duration.ofDays(10)).toD
S
• From external types
• From interval strings
• Function-constructor of
interval types:
make_interval(1, 2)
make_interval(1, 2, 3, 4,
5.123)
• From integral fields
13. Operations involving datetimes and intervals
Arithmetic operations involving values of type datetime or interval obey the natural rules associated
with dates and times and yield valid datetime or interval results according to the Gregorian calendar.
15. date + day-time interval
[SPARK-35051][SQL] Support add/subtract of a day-time interval to/from a date
16. spark.sql.legacy.interval.enabled
• When set to true, Spark SQL uses the mixed legacy interval
type CalendarIntervalType instead of the ANSI compliant
interval types YearMonthIntervalType and
DayTimeIntervalType.
• It impacts on:
• Dates and timestamp subtractions
• Parsing of ANSI interval literals:
INTERVAL ‘1 02:03:04’ DAY TO SECOND
18. External Java types
▪ This class models a quantity or
amount of time in terms of years,
months and days. Spark takes years
and months fields only.
▪ This class models a quantity or
amount of time in terms of seconds
and nanoseconds. Spark casts the
nanoseconds to microseconds.
java.time.Duration
java.time.Period
20. Specification of interval types in schemas
• Day-Time Interval Type
• Year-Month Interval type
▪ CREATE TABLE tbl (
id INT,
delay INTERVAL YEAR TO MONTH
)
▪ CREATE TABLE tbl (
len INT,
tout INTERVAL DAY TO SECOND
)
21. SPARK-27790: Support ANSI SQL INTERVAL types:
Milestone 1 – Spark Interval equivalency ( The new interval types meet or exceed all function of the existing SQL
Interval)
Milestone 2 – Persistence:
Ability to create tables of type interval
Ability to write to common file formats such as Parquet and JSON.
INSERT, SELECT, UPDATE, MERGE
Discovery
Milestone 3 – Client support
JDBC support
Hive Thrift server
Milestone 4 – PySpark and Spark R integration
Python UDF can take and return intervals
DataFrame support