Working With Large Numbers of Files¶
High-performance computing clusters typically use parallel filesystems optimized for large data throughput. These filesystems provide very high bandwidth for reading and writing large datasets, but they can perform poorly when workloads generate very large numbers of small files.
Operations such as opening, listing, or deleting thousands of files place heavy load on the filesystem metadata servers. As a result, jobs that create many small files may experience significantly slower performance and can negatively affect the system for other users.
System overload
Workflows that generate millions of small files can overload filesystem metadata servers and severely degrade performance for both your jobs and other users on the system.
Why Many Small Files Are Problematic¶
Each file stored on a filesystem requires metadata operations such as:
- file creation
- file lookup
- permission checks
- directory updates
When millions of small files are created, these metadata operations become the dominant cost of the workflow rather than the actual data transfer.
Typical symptoms include:
- slow
lsorducommands - slow job startup times
- long file deletion times
- reduced I/O performance
Common Situations That Produce Many Files¶
Workflows that often generate large numbers of files include:
- parameter sweeps producing one output file per job
- molecular dynamics trajectories written as many small snapshots
- logging systems producing one file per process
- temporary intermediate files created by data processing pipelines
Recommended Practices¶
Aggregate Files When Possible¶
Instead of producing thousands of individual files, combine outputs into larger containers.
For example, compress a directory into a single archive:
tar czf results_archive.tar.gz results_directory/
Aggregating files reduces metadata operations and improves filesystem performance.
Tip
Archiving directories into a single file can significantly reduce filesystem metadata load and improve transfer performance.
Use Application-Level Containers¶
Many scientific applications support storing multiple outputs in a single file format, such as:
- HDF5
- NetCDF
- SQLite databases
Containers
Container formats allow many datasets to be stored within a single file, reducing filesystem overhead and improving I/O efficiency.
Using container formats significantly reduces filesystem overhead.
Stage Data on Scratch During Jobs¶
Workflows generating many temporary files should run on scratch storage whenever possible. Scratch filesystems are optimized for high I/O workloads and temporary data.
/scratch usage
Use /scratch for intermediate files produced during job
execution. Move only the final results to persistent storage.
After the job completes, important results should be archived or moved to persistent storage.
Avoid Extremely Large Directories¶
Directories containing very large numbers of entries may become slow to access.
Instead of storing all files in one directory, organize files into a hierarchical structure. For example:
data/
run_01/
run_02/
run_03/
Note
Splitting files across multiple directories reduces lookup overhead and improves filesystem responsiveness.
Remove Temporary Files¶
Temporary files should be removed once they are no longer needed. This prevents accumulation of unused data and helps maintain good filesystem performance.
File cleanup
Cleaning temporary files regularly helps keep scratch storage available for all users.
Diagnosing File Count Problems¶
To determine how many files exist in a directory, you can run:
find <directory> -type f | wc -l
To inspect disk usage including directories:
du -sh *
Tools such as dust can also help visualize storage usage across
large directory trees.
Archiving Large File Collections¶
When transferring or storing directories containing many files, it is often more efficient to archive them into a single file before moving them between systems.
Example:
tar czf dataset.tar.gz dataset_directory/
This reduces transfer overhead and simplifies data management.