Understanding Delta Table Partition Size Distribution Using the Delta Log
Databricks/Delta Partitioning Strategies
When working with externally managed Delta tables and traditional partitioning strategies (for example by day, week, or month), one common challenge is:
How large are my partitions actually?
Before deciding whether to partition by day vs. week vs. month, it’s important to understand how data is physically distributed across partitions.
This article shows how to extract partition-level size statistics directly from the Delta transaction log.
⚠️ This approach is useful when using traditional partitioning and external table strategies.
If you’re using Unity Catalog managed tables or Liquid Clustering, partition sizing decisions are handled differently.
Why Look at the Delta Log?
Every Delta table maintains a _delta_log directory containing JSON transaction files.
Each add action inside the log includes:
File path
File size (in bytes)
Partition values
By reading these logs, we can compute:
Number of files per partition
Total bytes per partition
Size distribution across partitions
This gives direct visibility into the physical layout of your table.
Example: Compute Partition Sizes by startDate
from pyspark.sql import functions as F
delta_log = spark.read.json(
"abfss://container@storage-account.dfs.core.windows.net/table_location/_delta_log/*.json"
)
files = (
delta_log
.filter("add is not null")
.select(
F.col("add.path").alias("path"),
F.col("add.size").alias("size"),
F.col("add.partitionValues.startDate").alias("startDate")
)
)
(
files.groupBy("startDate")
.agg(
F.count("*").alias("numFiles"),
F.sum("size").alias("totalBytes")
)
.withColumn("sizeGB", F.col("totalBytes") / (1024**3))
.orderBy("startDate", ascending=False)
.show(20, False)
)
Sample Output
+----------+--------+-----------+--------+
|startDate |numFiles|totalBytes |sizeGB |
+----------+--------+-----------+--------+
|2026-02-16| xxx | xxx | xx |
|2026-02-15| xxx | xxx | xx |
+----------+--------+-----------+--------+How This Helps With Partition Strategy
Once you have partition size metrics, you can evaluate:
Are partitions too small?
If daily partitions are only a few MB:
You may be over-partitioning.
Consider partitioning by week or month instead.
Are partitions too large?
If partitions exceed hundreds of GB:
Queries may scan too much data.
Consider a finer-grained partitioning strategy.
Are files unevenly distributed?
High numFiles with small average size indicates small file issues.
Decision Guidelines
Partition Size (per partition) Recommendation
< 1 GB Likely over-partitioned
1–20 GB Usually healthy
50+ GBConsider finer partitioning
100+ GB May impact performance
(Adjust based on workload and query patterns.)
Bonus: Average File Size Per Partition
You can extend the analysis:
.withColumn("avgFileSizeMB", (F.col("totalBytes") / F.col("numFiles")) / (1024**2))
This helps detect small file problems inside partitions.
When Should You Use This?
This approach is particularly useful when:
Using external Delta tables
Managing your own storage layout
Designing a new partition strategy
Migrating legacy Hive-style tables
Troubleshooting performance issues
It gives a low-level, transparent view of how data is physically stored.
Final Thoughts
Partitioning decisions should be based on:
Query access patterns
Partition cardinality
Physical partition size
File size distribution
Reading the Delta transaction log provides a simple yet powerful way to understand your table layout — before committing to a partitioning strategy.


