Understanding Delta Table Partition Size Distribution Using the Delta Log

Databricks/Delta Partitioning Strategies

Feb 16, 2026

When working with externally managed Delta tables and traditional partitioning strategies (for example by day, week, or month), one common challenge is:

How large are my partitions actually?

Before deciding whether to partition by day vs. week vs. month, it’s important to understand how data is physically distributed across partitions.

This article shows how to extract partition-level size statistics directly from the Delta transaction log.

⚠️ This approach is useful when using traditional partitioning and external table strategies.
If you’re using Unity Catalog managed tables or Liquid Clustering, partition sizing decisions are handled differently.

Why Look at the Delta Log?

Every Delta table maintains a _delta_log directory containing JSON transaction files.
Each add action inside the log includes:

File path
File size (in bytes)
Partition values

By reading these logs, we can compute:

Number of files per partition
Total bytes per partition
Size distribution across partitions

This gives direct visibility into the physical layout of your table.

Example: Compute Partition Sizes by `startDate`

from pyspark.sql import functions as F

delta_log = spark.read.json(
    "abfss://container@storage-account.dfs.core.windows.net/table_location/_delta_log/*.json"
)

files = (
    delta_log
    .filter("add is not null")
    .select(
        F.col("add.path").alias("path"),
        F.col("add.size").alias("size"),
        F.col("add.partitionValues.startDate").alias("startDate")
    )
)

(
    files.groupBy("startDate")
    .agg(
        F.count("*").alias("numFiles"),
        F.sum("size").alias("totalBytes")
    )
    .withColumn("sizeGB", F.col("totalBytes") / (1024**3))
    .orderBy("startDate", ascending=False)
    .show(20, False)
)

Sample Output

+----------+--------+-----------+--------+
|startDate |numFiles|totalBytes |sizeGB  |
+----------+--------+-----------+--------+
|2026-02-16|  xxx   |   xxx     |   xx   |
|2026-02-15|  xxx   |   xxx     |   xx   |
+----------+--------+-----------+--------+

How This Helps With Partition Strategy

Once you have partition size metrics, you can evaluate:

Are partitions too small?

If daily partitions are only a few MB:

You may be over-partitioning.
Consider partitioning by week or month instead.

Are partitions too large?

If partitions exceed hundreds of GB:

Queries may scan too much data.
Consider a finer-grained partitioning strategy.

Are files unevenly distributed?

High numFiles with small average size indicates small file issues.

Decision Guidelines

Partition Size (per partition) Recommendation

< 1 GB Likely over-partitioned

1–20 GB Usually healthy

50+ GBConsider finer partitioning

100+ GB May impact performance

(Adjust based on workload and query patterns.)

Bonus: Average File Size Per Partition

You can extend the analysis:

.withColumn("avgFileSizeMB", (F.col("totalBytes") / F.col("numFiles")) / (1024**2))

This helps detect small file problems inside partitions.

When Should You Use This?

This approach is particularly useful when:

Using external Delta tables
Managing your own storage layout
Designing a new partition strategy
Migrating legacy Hive-style tables
Troubleshooting performance issues

It gives a low-level, transparent view of how data is physically stored.

Final Thoughts

Partitioning decisions should be based on:

Query access patterns
Partition cardinality
Physical partition size
File size distribution

Reading the Delta transaction log provides a simple yet powerful way to understand your table layout — before committing to a partitioning strategy.

UnleashDataBytes

Discussion about this post

Ready for more?