Skip to content

Working With Large Numbers of Files

High-performance computing clusters typically use parallel filesystems optimized for large data throughput. These filesystems provide very high bandwidth for reading and writing large datasets, but they can perform poorly when workloads generate very large numbers of small files.

Operations such as opening, listing, or deleting thousands of files place heavy load on the filesystem metadata servers. As a result, jobs that create many small files may experience significantly slower performance and can negatively affect the system for other users.

System overload

Workflows that generate millions of small files can overload filesystem metadata servers and severely degrade performance for both your jobs and other users on the system.

Why Many Small Files Are Problematic

Each file stored on a filesystem requires metadata operations such as:

  • file creation
  • file lookup
  • permission checks
  • directory updates

When millions of small files are created, these metadata operations become the dominant cost of the workflow rather than the actual data transfer.

Typical symptoms include:

  • slow ls or du commands
  • slow job startup times
  • long file deletion times
  • reduced I/O performance

Common Situations That Produce Many Files

Workflows that often generate large numbers of files include:

  • parameter sweeps producing one output file per job
  • molecular dynamics trajectories written as many small snapshots
  • logging systems producing one file per process
  • temporary intermediate files created by data processing pipelines

Aggregate Files When Possible

Instead of producing thousands of individual files, combine outputs into larger containers.

For example, compress a directory into a single archive:

tar czf results_archive.tar.gz results_directory/

Aggregating files reduces metadata operations and improves filesystem performance.

Tip

Archiving directories into a single file can significantly reduce filesystem metadata load and improve transfer performance.

Use Application-Level Containers

Many scientific applications support storing multiple outputs in a single file format, such as:

  • HDF5
  • NetCDF
  • SQLite databases

Containers

Container formats allow many datasets to be stored within a single file, reducing filesystem overhead and improving I/O efficiency.

Using container formats significantly reduces filesystem overhead.

Stage Data on Scratch During Jobs

Workflows generating many temporary files should run on scratch storage whenever possible. Scratch filesystems are optimized for high I/O workloads and temporary data.

/scratch usage

Use /scratch for intermediate files produced during job execution. Move only the final results to persistent storage.

After the job completes, important results should be archived or moved to persistent storage.

Avoid Extremely Large Directories

Directories containing very large numbers of entries may become slow to access.

Instead of storing all files in one directory, organize files into a hierarchical structure. For example:

data/
  run_01/
  run_02/
  run_03/

Note

Splitting files across multiple directories reduces lookup overhead and improves filesystem responsiveness.

Remove Temporary Files

Temporary files should be removed once they are no longer needed. This prevents accumulation of unused data and helps maintain good filesystem performance.

File cleanup

Cleaning temporary files regularly helps keep scratch storage available for all users.

Diagnosing File Count Problems

To determine how many files exist in a directory, you can run:

find <directory> -type f | wc -l

To inspect disk usage including directories:

du -sh *

Tools such as dust can also help visualize storage usage across large directory trees.

Archiving Large File Collections

When transferring or storing directories containing many files, it is often more efficient to archive them into a single file before moving them between systems.

Example:

tar czf dataset.tar.gz dataset_directory/

This reduces transfer overhead and simplifies data management.

Created by: Andrej Sec