How We Reduced Data Storage by 80%

Note: The project discussed in this article required the use of proprietary code. If you or someone you know are implementing a similar solution, generalized examples and assistance can be provided upon request.

Last year, I asked my team to delete 80% of our company's data. The thought spiked my blood pressure, but it was the right decision and it was overdue. Terabytes of data disappeared in minutes and it was one of the best decisions I've made in my career.

Data teams love to talk about scale, but far fewer talk about cleanup.

Left unmanaged, modern data platforms quietly accumulate massive volumes of unused files. Storage costs creep up, query performance degrades, and governance becomes more and more difficult. This is especially true in lakehouse architectures built on Parquet and Delta Lake, where historical versions of data are preserved by design.

We addressed this problem head-on by deleting unused Parquet files through a targeted Delta vacuuming strategy. The result was a leaner, more efficient storage footprint, without sacrificing critical capabilities like time travel.

Here’s how it works, and how you can apply the same approach.

Understanding Parquet Files

Parquet is a columnar storage format widely used in modern data platforms. Unlike row-based formats (like CSV), Parquet stores data by column, which enables faster query performance, better compression, and efficient analytical workloads.

However, Parquet’s strengths come with a tradeoff: it’s immutable. Every time data is updated in a Delta table, new Parquet files are written rather than modifying existing ones. Over time, this creates redundant files, outdated versions of data, and fragmentation across storage.

This isn’t a flaw; it’s what enables powerful features like versioning and rollback. But without active management, it leads to uncontrolled growth. That growth can get out of hand as it ramps up exponentially, silently hamstringing engineering teams.

Understanding Delta Lake

Delta Lake builds on top of Parquet to provide ACID transactions, schema enforcement, and, critically, time travel. Time travel allows you to query historical versions of a table, which is invaluable for debugging pipelines, recovering from bad writes, or auditing changes.

That said, time travel depends on retaining old Parquet files. If you never clean them up, storage balloons. If you delete too aggressively, you lose the ability to recover. Optimization is about finding the right balance.

The Role of Vacuuming in Delta Optimization

Vacuuming, in context of Delta tables, is simply the process of safely removing unused Parquet files that are no longer needed. At a high level, a vacuum operation identifies files not referenced by the current table state, removes files older than a defined retention period, and reserves files required for time travel within that retention window.

By default, most platforms enforce a minimum retention of 7 days to protect against accidental data loss. That said, those limits are technically able to be bypassed, but most production environments require a more thoughtful policy.

Our Approach: 45-Day Retention with Automated Cleanup

After evaluating our operational needs my team and I decided on a 45-day retention window. That gave us enough historical depth for incident recovery, as well as a clear boundary to control storage growth.

From there, we implemented a simple but effective system which still runs like clockwork today.

1) Scheduled notebooks run weekly.

2) Each notebook run executes a loop of targeted vacuuming operation with a 45-day retention threshold.

3) Old, unreferenced Parquet files are permanently deleted to save space and maximize query performance.

The impact was immediate. By removing stale data files, we reduced our storage footprint by more than 80%, quite literally overnight. Storage costs were slashed, queries were lightning-fast once again, and governance and lifecycle management had never been easier.

Automating the Process with Python and Semantic Link Labs

While vacuuming operations can be run manually, automation is where this strategy becomes sustainable.

We implemented our cleanup logic in Python notebooks, which gave us flexibility to automatically target specific tables, log and monitor cleanup activity, integrate with Fabric's native pipelines and scheduling systems, and assess capacity usage for the operations.

As a very important side note, vacuuming large amounts of parquet files at once may be capacity-intensive. Our Microsoft Fabric tenant uses multiple F64 capacities, which quickly throttled when we first began to vacuum terabytes of parquet files at once. We now vacuum half of our tables on the 15th of the month and the other half on the 30th in order to split up the capacity burden.

Other teams, like my own, who are working in Python notebooks within Microsoft Fabric, will likely benefit from Semantic Link Labs.

Semantic Link Labs provides utilities that simplify working with Delta tables programmatically, including lifecycle operations like vacuuming. Instead of stitching together low-level commands, you can script repeatable cleanup workflow, apply consistent retention policies across datasets, and integrate storage optimization into your broader data engineering pipelines.

This is especially valuable if your team prefers Python over SQL or is already managing infrastructure through Fabric notebooks.

My team used the following code to begin development on these notebooks:

# Import the necessary library

import sempy_labs.lakehouse as lake

# Enter the name or ID of the lakehouse

lakehouse = None

# Enter the name or ID of the workspace in which the lakehouse exists

workspace = None

# The number of hours to retain historical versions of Delta table files

retain_n_hours = None

# Example 1: Vacuum a single table

lake.vacuum_lakehouse_tables(tables='MyTable', lakehouse=lakehouse, workspace=workspace)

# Example 2: Vacuum several tables lake.vacuum_lakehouse_tables(tables=['MyTable', 'MySecondTable'], lakehouse=lakehouse, workspace=workspace)

# Example 3: Vacuum all delta tables within the lakehouse

lake.vacuum_lakehouse_tables(tables=None, lakehouse=lakehouse, workspace=workspace)

Key Considerations Before You Vacuum

Before implementing a vacuuming strategy, there are a few things you need to get right.

1. Define Your Retention Window Carefully
Shorter retention saves storage but limits recovery options. Align this with your incident response needs.

2. Understand Downstream Dependencies
Ensure no processes rely on older snapshots beyond your retention period.

3. Monitor Before and After
Track storage needs, capacity usage, and query performance to quantify the impact.

4. Be Aware of Safety Checks
Delta allows bypassing retention safeguards, but doing so without full awareness can lead to irreversible data loss.

Final Thoughts

Data systems don’t just need to scale, they need to be maintained. Parquet and Delta Lake give you powerful tools for building reliable, auditable pipelines. But without intentional cleanup, those same systems will quietly accumulate waste.

Vacuuming isn’t glamorous work. It doesn’t ship features or unlock new capabilities. But it’s one of the highest-leverage optimizations you can make.

If you’re sitting on months (or years) of unused Parquet files, you’re not just paying for storage, you’re carrying unnecessary complexity. But the good news is that fixing it is simpler than you may think, and the payoff is immediate.

Happy cleaning!

Back to Articles