A Complete Guide to Slowly Changing Dimensions with Databricks Delta Tables

A Complete Guide to Slowly Changing Dimensions with Databricks Delta Tables

In data warehousing, Slowly Changing Dimensions (SCD) are essential for accurately tracking and managing changes in data over time. This guide provides an in-depth exploration of SCD types 0 through 4 and their practical implementation in modern data platforms, with a special focus on Databricks Delta Tables. By marrying traditional data warehousing techniques with advanced data platform features, we aim to offer a holistic view for data professionals navigating this complex landscape.

Exploring Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions are methodologies employed in data warehousing to manage and track changes in dimension data, such as customer details or product information, over time.

SCD Type 0 — Fixed Dimension

  • Characteristics: Data remains constant post-loading.
  • Use Case: Ideal for static, historical data that should not be altered.

SCD Type 1 — Current Data Only

  • Characteristics: Stores only the latest data, overwriting previous information.

  • Use Case: Best suited for cases where only the most recent data is necessary, and historical tracking is not required.

SCD Type 2 — Historical Tracking

  • Characteristics: Creates new records for each change, maintaining a comprehensive history.

  • Use Case: Widely used for detailed historical analysis and when preserving the timeline of data changes is crucial.

SCD Type 3 — Previous Value Field

  • Characteristics: Tracks changes using additional columns for previous data.

  • Use Case: Effective for cases where tracking a limited history (like the immediate last change) is sufficient.

SCD Type 4 — History Table

  • Characteristics: Utilizes separate tables for current and historical data.

  • Use Case: Optimizes performance in scenarios where historical data is accessed less frequently but needs to be preserved.

Databricks Delta Tables: Enhancing SCD Implementation

Databricks Delta Tables offer a robust solution for implementing SCDs, leveraging advanced features like ACID transactions, scalability, and real-time processing.

Implementing Various SCD Types in Delta Tables

Let’s say we need to track changes in product pricing over time.

1. SCD Type 0: Ensures data immutability, essential for maintaining static historical data. In this configuration, we make sure that data doesn’t change or get overwritten. Price never change.

Inflation has nothing to do here!

CREATE TABLE product_pricing ( product_id INT, price DOUBLE, effective_date DATE ) USING DELTA;

INSERT INTO product_pricing (product_id, price, effective_date) VALUES (1, 100.00, '2023-01-01'), (2, 200.00, '2023-01-01');

2. SCD Type 1: We update but no history kept.

UPDATE product_pricing SET price = 150.00 WHERE product_id = 1;

SCD Type 1 is easy to implement with Delta Tables. Using *UPDATE* operations, we’re able to ensure data consistency and integrity but we lose the historical changes that occurs before.

3. SCD Type 2: Using SCD Type 2, we’ll have to add some additional columns like start_date and end_date to handle the price over time. When adding new product, the end_date is null and price is active and when updating an existing one, we kind of archive it by setting it to innactive and updating the end date from null into the current date.

ALTER TABLE product_pricing ADD COLUMNS ( is_price_active BOOLEAN, start_date DATE, end_date DATE );

MERGE INTO product_pricing AS target USING updated_pricing AS source ON target.product_id = source.product_id AND target.is_price_active = true WHEN MATCHED THEN UPDATE SET target.is_price_active = false, target.end_date = current_date() WHEN NOT MATCHED THEN INSERT (product_id, price, start_date, is_price_active, end_date) VALUES (source.product_id, source.price, current_date(), true, NULL);

The *MERGE* operation in Delta Tables efficiently handles insertions of new records and updates to existing ones, ideal for maintaining a detailed change history.

4. SCD Type 3: Recording the current and previous price in the same record.

MERGE INTO product_pricing AS target USING updated_pricing AS source ON target.product_id = source.product_id WHEN MATCHED THEN UPDATE SET target.previous_price = target.current_price, target.current_price = 9.99 WHEN NOT MATCHED THEN INSERT (product_id, previous_price, current_price) VALUES (source.product_id, NULL, 9.99);

This type uses a custom field to track specific previous values. But what if the product changed price twice? Using SCD type 3, we can only keep track of the current and previous price. To handle this case, we can use SCD type 4.

5. SCD Type 4 with Change Data Feed (CDF): Monitoring modifications in product price change log table.

To do so, we should either create the table with change data feed enabled or updating a table to enable it using the table properties ‘*delta.enableChangeDataFeed’*.

To invoke this function you need to have at least one of the following:

  • SELECT privilege on the specified table

  • Be the owner of the table

  • Have administrative privileges

When doing it into an existing table, only the future changed will be tracked.

-- We can enable the CDF during table creation CREATE TABLE product_pricing(product_id LONG, price DOUBLE) TBLPROPERTIES (delta.enableChangeDataFeed=true);

-- We can also do it on existing tables. In this case the CDF -- will be enabled only for the table future changes ALTER TABLE product_pricing SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

When changing price in a table where CDF is enabled, we can use *table_changes()* to observe historical changes.

-- Insert new products INSERT INTO product_pricing (product_id, price) VALUES (1, 100.0), (2, 200.0), (3, 300.0);

-- Update the price of a product UPDATE product_pricing SET price = 250.0 WHERE product_id = 2;

-- Delete a product DELETE FROM product_pricing WHERE product_id = 3;

-- Select all changes on the tables from the CDF enabling -- We can be more specific when querying tables changes using the start and/or arguments. -- They can either be a commit version or change timestamp SELECT * FROM table_changes ('product_pricing', start [, end ] )

Running that query on the table:

CDF provides a powerful tool for real-time change tracking, offering a detailed audit trail for data changes. All operations can be tracked easily.

Conclusion

The concept of Slowly Changing Dimensions is a cornerstone in the field of data warehousing, offering nuanced strategies to manage evolving data. The integration of these methodologies with Databricks Delta Tables represents a significant advance in data management. Delta Tables not only simplify the implementation of traditional SCD types but also enrich them with modern features like ACID transactions and real-time processing. The addition of Change Data Feed further enhances Delta Tables, making them a formidable tool for contemporary data engineering challenges. This blend of traditional and modern techniques exemplifies the evolving landscape of data management, providing comprehensive solutions for today’s data-driven decision-making processes.

References

table_changes table-valued function
*Learn the syntax of the table_changes function of the SQL language in Databricks SQL and Databricks Runtime.*
docs.databricks.com

Use Delta Lake change data feed on Databricks
*Learn how to get row-level change information from Delta tables using the Delta Lake change data feed.*
docs.databricks.com

https://docs.delta.io/0.4.0/delta-update.html

Did you find this article valuable?

Support Omar LARAQUI by becoming a sponsor. Any amount is appreciated!