SQL Querying & Tuning for Large Datasets

Introduction: SQL in the Big Data World

+

Window Functions: Powerful Analytical SQL

+

Window functions perform calculations across a set of table rows that are related to the current row. Unlike aggregate functions (which return a single value for a group), window functions return a value for each row in the window.

Syntax:


FUNCTION(column) OVER (
    [PARTITION BY column1 [, column2, ...]]
    [ORDER BY column3 [ASC|DESC] [, column4 [ASC|DESC], ...]]
    [ROWS BETWEEN start AND end | RANGE BETWEEN start AND end]
)

PARTITION BY: Divides the rows into groups (partitions) where the window function operates independently. Similar to GROUP BY, but does not collapse rows.
ORDER BY: Orders the rows within each partition. Essential for rank-based or sequential calculations.
Window Frame (ROWS BETWEEN / RANGE BETWEEN): Defines the set of rows within the partition that are included in the calculation for the current row.
- UNBOUNDED PRECEDING: From the start of the partition.
- CURRENT ROW: The current row.
- N PRECEDING / N FOLLOWING: N rows before/after the current row.
- UNBOUNDED FOLLOWING: To the end of the partition.

Common Window Functions & Examples:

Ranking Functions:

ROW_NUMBER(): Assigns a unique, sequential integer to each row within its partition, starting from 1.


SELECT
    product_id,
    sale_date,
    sales_amount,
    ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY sale_date DESC) as rn
FROM
    sales_data;

RANK(): Assigns a rank within its partition. If values are tied, they receive the same rank, and the next rank is skipped.


SELECT
    department,
    employee_name,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
FROM
    employees;

DENSE_RANK(): Similar to RANK(), but if values are tied, they receive the same rank, and no ranks are skipped.


SELECT
    category,
    item_name,
    price,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY price ASC) as price_dense_rank
FROM
    products;

NTILE(n): Divides the rows in each partition into 'n' groups and assigns a group number (from 1 to n) to each row.


SELECT
    customer_id,
    order_total,
    NTILE(4) OVER (ORDER BY order_total DESC) as quartile
FROM
    orders;

Analytic Functions:

LAG(column, offset, default): Accesses data from a previous row in the same result set.


SELECT
    transaction_date,
    amount,
    LAG(amount, 1, 0) OVER (ORDER BY transaction_date) as previous_day_amount
FROM
    daily_transactions;

LEAD(column, offset, default): Accesses data from a subsequent row in the same result set.


SELECT
    transaction_date,
    amount,
    LEAD(amount, 1, 0) OVER (ORDER BY transaction_date) as next_day_amount
FROM
    daily_transactions;

FIRST_VALUE(column) / LAST_VALUE(column): Returns the value of the specified column from the first/last row of the window frame.


SELECT
    employee_name,
    department,
    hire_date,
    FIRST_VALUE(employee_name) OVER (PARTITION BY department ORDER BY hire_date) as first_hire_in_dept
FROM
    employees;

Aggregate Functions as Window Functions (e.g., SUM(), AVG(), COUNT()): Perform aggregations over a window.


SELECT
    sale_date,
    region,
    sales_amount,
    SUM(sales_amount) OVER (PARTITION BY region ORDER BY sale_date ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as 4_day_rolling_sum
FROM
    regional_sales;

Advanced Aggregations: CUBE, ROLLUP, and GROUPING SETS

+

These extensions to the GROUP BY clause allow you to generate multiple levels of aggregation in a single query, which is useful for reporting and OLAP-style analysis.

ROLLUP(a, b): Creates subtotals that "roll up" from the most detailed level to a grand total. It generates grouping sets for `(a, b)`, `(a)`, and `()`.
CUBE(a, b): Creates subtotals for all possible combinations of the given columns. It generates grouping sets for `(a, b)`, `(a)`, `(b)`, and `()`.
GROUPING SETS((a, b), (a), (b), ()): Allows you to explicitly specify the exact combinations (grouping sets) you want to aggregate.


SELECT
    region,
    department,
    SUM(sales) as total_sales
FROM
    sales_data
GROUP BY
    GROUPING SETS ((region, department), (region), ());

SQL Query Tuning for Large Datasets

+

Optimizing SQL queries on distributed systems like Spark (Databricks) is critical for performance and cost efficiency. It often involves reducing data processed, optimizing shuffles, and leveraging underlying data formats.

1. Understand the Query Execution Plan:

Use EXPLAIN EXTENDED or EXPLAIN FORMATTED (in Databricks/Spark SQL) to see the logical and physical plan. Look for expensive operations like full table scans, large shuffles, and broadcast joins.


EXPLAIN EXTENDED
SELECT count(*) FROM large_table WHERE date_col = '2023-01-01';

2. Data Filtering and Projection:

Filter Early, Filter Often: Apply WHERE clauses as early as possible to reduce the amount of data processed in subsequent steps.
Select Only Necessary Columns: Avoid SELECT *. Project only the columns required for your query. This reduces I/O and memory usage.

3. Optimize Joins:

Broadcast Hash Join: If one table is significantly smaller (e.g., < 100MB-1GB, configurable via spark.sql.autoBroadcastJoinThreshold), Spark can broadcast it to all executors, avoiding a shuffle for the larger table. Explicitly hint with BROADCAST.
```
SELECT a.*, b.dim_name
FROM
    fact_table a
JOIN BROADCAST(dimension_table) b ON a.dim_key = b.dim_key;
                        
```
Sort-Merge Join: Default for large joins. Requires both tables to be sorted and then merged. Involves a shuffle.
Skewed Joins: If a join key has highly uneven distribution (data skew), it can lead to "hot spots" where a few tasks take much longer.
- Salting: Add a random suffix to the skewed key in both tables, then join on the composite key. This distributes the skewed key across more partitions.
- Adaptive Query Execution (AQE): Spark 3.0+ can automatically detect and optimize skewed joins. Ensure spark.sql.adaptive.skewJoin.enabled is true.

4. Leverage Data Layout (Delta Lake specific):

Partitioning: Organize data by common filter columns (e.g., date, country). Reduces scan time by pruning irrelevant partitions.


CREATE TABLE sales_by_date (...)
USING DELTA
PARTITIONED BY (sale_date);

Z-Ordering: For Delta tables, use ZORDER BY on frequently queried high-cardinality columns (e.g., user_id, product_id) to co-locate related data and improve data skipping.
```
OPTIMIZE my_delta_table
ZORDER BY (user_id, product_id);
                        
```
Liquid Clustering: A more flexible alternative to partitioning and Z-Ordering in Delta Lake, automatically adapting to data changes and query patterns.
Compaction (OPTIMIZE): Regularly run OPTIMIZE to combine small files into larger ones, reducing metadata overhead and improving read performance.

5. Aggregations:

Pre-aggregation: If certain aggregations are frequently used, consider pre-calculating and storing them in a separate Gold layer table.
Minimize Distinct Counts: COUNT(DISTINCT column) is expensive as it requires a global shuffle. Use approximations if exact count isn't needed (e.g., HyperLogLog functions).

6. Configuration Tuning (Databricks specific):

Adaptive Query Execution (AQE): Ensure spark.sql.adaptive.enabled=true. It dynamically optimizes query plans at runtime.
Photon Engine: Use Photon-enabled clusters for significant performance boosts on SQL and DataFrame operations.
Shuffle Partitions: Adjust spark.sql.shuffle.partitions based on cluster size and data volume. Too few can lead to large tasks; too many can lead to excessive overhead.

Other Useful Concepts & Best Practices

+

Platform-Specific Tuning Considerations:

Databricks/Spark SQL: Heavily relies on partitioning, Z-Ordering/Liquid Clustering, and broadcast joins. AQE is a key feature to leverage.
BigQuery: Automatically handles many optimizations. The key is to leverage clustering and partitioning to reduce the amount of data scanned.
Redshift: Requires more manual tuning of distribution styles and sort keys. Understanding the query plan is critical.

1. Common Table Expressions (CTEs):

Improve readability and modularity of complex queries. Can sometimes aid optimization by allowing the optimizer to better understand the query flow.


WITH
    daily_sales AS (
        SELECT
            sale_date,
            SUM(amount) as total_daily_sales
        FROM
            sales
        GROUP BY
            sale_date
    ),
    monthly_avg AS (
        SELECT
            DATE_TRUNC('month', sale_date) as month,
            AVG(total_daily_sales) as avg_monthly_sales
        FROM
            daily_sales
        GROUP BY
            DATE_TRUNC('month', sale_date)
    )
SELECT * FROM monthly_avg;

2. Subqueries:

Can be used in SELECT, FROM, or WHERE clauses. Correlated subqueries (where the inner query depends on the outer query) can be very inefficient on large datasets and should often be rewritten as joins or window functions.

3. Delta Lake SQL Commands:

MERGE INTO: For UPSERT operations (insert if not exists, update if exists). Highly efficient for CDC (Change Data Capture) and managing slowly changing dimensions.


MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET t.value = s.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (s.id, s.value);

VACUUM: Removes data files no longer referenced by a Delta table and older than the retention threshold. Important for cost management and GDPR compliance.
```
VACUUM my_delta_table RETAIN 168 HOURS;
                        
```

RESTORE: Reverts a Delta table to an earlier version using time travel.


RESTORE TABLE my_delta_table TO VERSION AS OF 5;
RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2023-01-01 10:00:00';

4. Data Type Optimization:

Use the most compact data types possible (e.g., INT instead of BIGINT if values fit, DATE instead of TIMESTAMP if time isn't needed). This reduces storage and memory footprint.