Working With Large Numbers of Files¶

High-performance computing clusters typically use parallel filesystems optimized for large data throughput. These filesystems provide very high bandwidth for reading and writing large datasets, but they can perform poorly when workloads generate very large numbers of small files.

Operations such as opening, listing, or deleting thousands of files place heavy load on the filesystem metadata servers. As a result, jobs that create many small files may experience significantly slower performance and can negatively affect the system for other users.

System overload

Workflows that generate millions of small files can overload filesystem metadata servers and severely degrade performance for both your jobs and other users on the system.

Why Many Small Files Are Problematic¶

Each file stored on a filesystem requires metadata operations such as:

file creation
file lookup
permission checks
directory updates

When millions of small files are created, these metadata operations become the dominant cost of the workflow rather than the actual data transfer.

Typical symptoms include:

slow ls or du commands
slow job startup times
long file deletion times
reduced I/O performance

Common Situations That Produce Many Files¶

Workflows that often generate large numbers of files include:

parameter sweeps producing one output file per job
molecular dynamics trajectories written as many small snapshots
logging systems producing one file per process
temporary intermediate files created by data processing pipelines

Recommended Practices¶

Aggregate Files When Possible¶

Instead of producing thousands of individual files, combine outputs into larger containers.

For example, compress a directory into a single archive:

tar czf results_archive.tar.gz results_directory/

Aggregating files reduces metadata operations and improves filesystem performance.

Tip

Archiving directories into a single file can significantly reduce filesystem metadata load and improve transfer performance.

Use Application-Level Containers¶

Many scientific applications support storing multiple outputs in a single file format, such as:

HDF5
NetCDF
SQLite databases

Containers

Container formats allow many datasets to be stored within a single file, reducing filesystem overhead and improving I/O efficiency.

Using container formats significantly reduces filesystem overhead.

Stage Data on Scratch During Jobs¶

Workflows generating many temporary files should run on scratch storage whenever possible. Scratch filesystems are optimized for high I/O workloads and temporary data.

/scratch usage

Use /scratch for intermediate files produced during job execution. Move only the final results to persistent storage.

After the job completes, important results should be archived or moved to persistent storage.

Avoid Extremely Large Directories¶

Directories containing very large numbers of entries may become slow to access.

Instead of storing all files in one directory, organize files into a hierarchical structure. For example:

data/
  run_01/
  run_02/
  run_03/

Note

Splitting files across multiple directories reduces lookup overhead and improves filesystem responsiveness.

Remove Temporary Files¶

Temporary files should be removed once they are no longer needed. This prevents accumulation of unused data and helps maintain good filesystem performance.

File cleanup

Cleaning temporary files regularly helps keep scratch storage available for all users.

Diagnosing File Count Problems¶

To determine how many files exist in a directory, you can run:

find <directory> -type f | wc -l

To inspect disk usage including directories:

du -sh *

Tools such as dust can also help visualize storage usage across large directory trees.

Archiving Large File Collections¶

When transferring or storing directories containing many files, it is often more efficient to archive them into a single file before moving them between systems.

Example:

tar czf dataset.tar.gz dataset_directory/

This reduces transfer overhead and simplifies data management.

Created by: Andrej Sec