Improve Fortran formatted I/O with a large number of small files

Question

Lets assume I have the following requirements for writing monitor files from a simulation:

A large number of individual files has to be written, typically in the order of 10000
The files must be human-readable, i.e. formatted I/O
Periodically, a new line is added to each file. Typically every 50 seconds.
The new data has to be accessible almost instantly, so large manual write buffers are not an option
We are on a Lustre file system that appears to be optimized for just about the opposite: sequential writes to a small number of large files.

It was not me who formulated the requirements so unfortunately there is no point in discussing them. I would just like to find the best possible solution with above prerequisites. I came up with a little working example to test a few implementations. Here is the best I could do so far:

!===============================================================!
! program to test some I/O implementations for many small files !
!===============================================================!
PROGRAM iotest

    use types
    use omp_lib

    implicit none

    INTEGER(I4B), PARAMETER :: steps = 1000
    INTEGER(I4B), PARAMETER :: monitors = 1000
    INTEGER(I4B), PARAMETER :: cachesize = 10

    INTEGER(I8B) :: counti, countf, count_rate, counti_global, countf_global
    REAL(DP) :: telapsed, telapsed_global
    REAL(DP), DIMENSION(:,:), ALLOCATABLE :: density, pressure, vel_x, vel_y, vel_z
    INTEGER(I4B) :: n, t, unitnumber, c, i, thread
    CHARACTER(LEN=100) :: dummy_char, number
    REAL(DP), DIMENSION(:,:,:), ALLOCATABLE :: writecache_real

    call system_clock(counti_global,count_rate)

    ! allocate cache
    allocate(writecache_real(5,cachesize,monitors))
    writecache_real = 0.0_dp

    ! fill values
    allocate(density(steps,monitors), pressure(steps,monitors), vel_x(steps,monitors), vel_y(steps,monitors), vel_z(steps,monitors))
    do n=1, monitors
        do t=1, steps
            call random_number(density(t,n))
            call random_number(pressure(t,n))
            call random_number(vel_x(t,n))
            call random_number(vel_y(t,n))
            call random_number(vel_z(t,n))
        end do
    end do

    ! create files
    do n=1, monitors
        write(number,'(I0.8)') n
        dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
        open(unit=20, file=trim(adjustl(dummy_char)), status='replace', action='write')
        close(20)
    end do

    call system_clock(counti)

    ! write data
    c = 0
    do t=1, steps
        c = c + 1
        do n=1, monitors
            writecache_real(1,c,n) = density(t,n)
            writecache_real(2,c,n) = pressure(t,n)
            writecache_real(3,c,n) = vel_x(t,n)
            writecache_real(4,c,n) = vel_y(t,n)
            writecache_real(5,c,n) = vel_z(t,n)
        end do
        if(c .EQ. cachesize .OR. t .EQ. steps) then
            !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(n,number,dummy_char,unitnumber, thread)
            thread = OMP_get_thread_num()
            unitnumber = thread + 20
            !$OMP DO
            do n=1, monitors
                write(number,'(I0.8)') n
                dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
                open(unit=unitnumber, file=trim(adjustl(dummy_char)), status='old', action='write', position='append', buffered='yes')
                write(unitnumber,'(5ES25.15)') writecache_real(:,1:c,n)
                close(unitnumber)
            end do
            !$OMP END DO
            !$OMP END PARALLEL
            c = 0
        end if
    end do

    call system_clock(countf)
    call system_clock(countf_global)
    telapsed=real(countf-counti,kind=dp)/real(count_rate,kind=dp)
    telapsed_global=real(countf_global-counti_global,kind=dp)/real(count_rate,kind=dp)
    write(*,*)
    write(*,'(A,F15.6,A)') ' elapsed wall time for I/O: ', telapsed, ' seconds'
    write(*,'(A,F15.6,A)') ' global elapsed wall time:  ', telapsed_global, ' seconds'
    write(*,*)

END PROGRAM iotest

The main features are: OpenMP parallelization and a manual write buffer. Here are some of the timings on the Lustre file system with 16 threads:

cachesize=5: elapsed wall time for I/O: 991.627404 seconds
cachesize=10: elapsed wall time for I/O: 415.456265 seconds
cachesize=20: elapsed wall time for I/O: 93.842964 seconds
cachesize=50: elapsed wall time for I/O: 79.859099 seconds
cachesize=100: elapsed wall time for I/O: 23.937832 seconds
cachesize=1000: elapsed wall time for I/O: 10.472421 seconds

For reference the results on a local workstation HDD with deactivated HDD write cache, 16 threads:

cachesize=1: elapsed wall time for I/O: 5.543722 seconds
cachesize=2: elapsed wall time for I/O: 2.791811 seconds
cachesize=3: elapsed wall time for I/O: 1.752962 seconds
cachesize=4: elapsed wall time for I/O: 1.630385 seconds
cachesize=5: elapsed wall time for I/O: 1.174099 seconds
cachesize=10: elapsed wall time for I/O: 0.700624 seconds
cachesize=20: elapsed wall time for I/O: 0.433936 seconds
cachesize=50: elapsed wall time for I/O: 0.425782 seconds
cachesize=100: elapsed wall time for I/O: 0.227552 seconds

As you can see the implementation is still embarrassingly slow on the Lustre file system compared to an ordinary HDD and I would need huge buffer sizes to reduce the I/O overhead to a tolerable extent. This would mean that the output lags behind which is against the requirements formulated earlier. Another promising approach was leaving the units open between consecutive writes. Unfortunately, the number of units open simultaneously is limited to typically 1024-4096 without root privileges. So this is not an option because the number of files can exceed this limit.

How could the I/O overhead be further reduced while still fulfilling the requirements?

Edit 1 From the discussion with Gilles I learned that the lustre file system can be tweaked even with normal user privileges. So I tried setting the stripe count to 1 as suggested (this was already the default setting) and decreased the stripe size to the minimum supported value of 64k (default value was 1M). However, this did not improve I/O performance with my test case. If anyone has additional hints on more suitable file system settings please let me know.

Did you try that: 1/ not striping the files on luster (`lsf setstripe -c 1`); 2/ making sure that your cached size is fixed and aligned with the stripe size (avoid read-write for a write); 3/ adjusting the hard limit for the number of opened files globally (`ulimit -Hn`), and therefore allowing for users to adjust the soft limit to a value permitting not to close the files (`ulimit -Sn`)... ? — Gilles, Dec 16 '16 at 07:45
I can not change the file system settings or adjust hard limits. Only normal user privileges on the cluster. Plus changing system settings to run the simulation was ruled out by the person who formulated the requirements for "portability reasons and user-friendliness". — MechEng, Dec 16 '16 at 09:41
`lfs setstripe` doesn't require any special privilege. It is a user command that can be applied for directories or individual files for adjusting the fs' behavior in regard to stripping. And as for the hard limit for opened file numbers, I'm sure that it can be adjusted once and for all by system administrators, without touching the soft limit and therefore not impacting any other users. However, with a large enough hard limit, your user could set his soft limit to a value commensurable with his problem. — Gilles, Dec 16 '16 at 10:35
`lfs`, not lsf as I erroneously wrote on my first comment. See [this](http://wiki.lustre.org/index.php/Configuring_Lustre_File_Striping) — Gilles, Dec 16 '16 at 11:04
My bad. Since I am new to the whole file system stuff: I am not quite sure what to make of your second suggestion"making sure that your cached size is fixed and aligned with the stripe size (avoid read-write for a write)". What exactly is my "cached size" in this example and how do I fix/align it with the stripe size? — MechEng, Dec 16 '16 at 11:23
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/130780/discussion-between-mecheng-and-gilles). — MechEng, Dec 16 '16 at 12:10
Is it required that all of this be done by a single instance of your program? Could you split the task across multiple, such that you could stay below file unit cap and therefore don't need to repeatedly open/close the files? — TTT, Dec 23 '16 at 14:43

score 0 · Answer 1 · answered Jun 13 '18 at 05:40

For everyone suffering from small files performance, the new lustre release 2.11 allows storing the small files directly on MDT, which improves access time for those.

http://cdn.opensfs.org/wp-content/uploads/2018/04/Leers-Lustre-Data_on_MDT_An_Early_Look_DDN.pdf

lfs setstripe -E 1M -L mdt -E -1 fubar fill store the first megabyte of all files in directory fubar on MDT

Improve Fortran formatted I/O with a large number of small files

1 Answers1