Lets assume I have the following requirements for writing monitor files from a simulation:
- A large number of individual files has to be written, typically in the order of 10000
- The files must be human-readable, i.e. formatted I/O
- Periodically, a new line is added to each file. Typically every 50 seconds.
- The new data has to be accessible almost instantly, so large manual write buffers are not an option
- We are on a Lustre file system that appears to be optimized for just about the opposite: sequential writes to a small number of large files.
It was not me who formulated the requirements so unfortunately there is no point in discussing them. I would just like to find the best possible solution with above prerequisites. I came up with a little working example to test a few implementations. Here is the best I could do so far:
!===============================================================!
! program to test some I/O implementations for many small files !
!===============================================================!
PROGRAM iotest
use types
use omp_lib
implicit none
INTEGER(I4B), PARAMETER :: steps = 1000
INTEGER(I4B), PARAMETER :: monitors = 1000
INTEGER(I4B), PARAMETER :: cachesize = 10
INTEGER(I8B) :: counti, countf, count_rate, counti_global, countf_global
REAL(DP) :: telapsed, telapsed_global
REAL(DP), DIMENSION(:,:), ALLOCATABLE :: density, pressure, vel_x, vel_y, vel_z
INTEGER(I4B) :: n, t, unitnumber, c, i, thread
CHARACTER(LEN=100) :: dummy_char, number
REAL(DP), DIMENSION(:,:,:), ALLOCATABLE :: writecache_real
call system_clock(counti_global,count_rate)
! allocate cache
allocate(writecache_real(5,cachesize,monitors))
writecache_real = 0.0_dp
! fill values
allocate(density(steps,monitors), pressure(steps,monitors), vel_x(steps,monitors), vel_y(steps,monitors), vel_z(steps,monitors))
do n=1, monitors
do t=1, steps
call random_number(density(t,n))
call random_number(pressure(t,n))
call random_number(vel_x(t,n))
call random_number(vel_y(t,n))
call random_number(vel_z(t,n))
end do
end do
! create files
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=20, file=trim(adjustl(dummy_char)), status='replace', action='write')
close(20)
end do
call system_clock(counti)
! write data
c = 0
do t=1, steps
c = c + 1
do n=1, monitors
writecache_real(1,c,n) = density(t,n)
writecache_real(2,c,n) = pressure(t,n)
writecache_real(3,c,n) = vel_x(t,n)
writecache_real(4,c,n) = vel_y(t,n)
writecache_real(5,c,n) = vel_z(t,n)
end do
if(c .EQ. cachesize .OR. t .EQ. steps) then
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(n,number,dummy_char,unitnumber, thread)
thread = OMP_get_thread_num()
unitnumber = thread + 20
!$OMP DO
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=unitnumber, file=trim(adjustl(dummy_char)), status='old', action='write', position='append', buffered='yes')
write(unitnumber,'(5ES25.15)') writecache_real(:,1:c,n)
close(unitnumber)
end do
!$OMP END DO
!$OMP END PARALLEL
c = 0
end if
end do
call system_clock(countf)
call system_clock(countf_global)
telapsed=real(countf-counti,kind=dp)/real(count_rate,kind=dp)
telapsed_global=real(countf_global-counti_global,kind=dp)/real(count_rate,kind=dp)
write(*,*)
write(*,'(A,F15.6,A)') ' elapsed wall time for I/O: ', telapsed, ' seconds'
write(*,'(A,F15.6,A)') ' global elapsed wall time: ', telapsed_global, ' seconds'
write(*,*)
END PROGRAM iotest
The main features are: OpenMP parallelization and a manual write buffer. Here are some of the timings on the Lustre file system with 16 threads:
- cachesize=5: elapsed wall time for I/O: 991.627404 seconds
- cachesize=10: elapsed wall time for I/O: 415.456265 seconds
- cachesize=20: elapsed wall time for I/O: 93.842964 seconds
- cachesize=50: elapsed wall time for I/O: 79.859099 seconds
- cachesize=100: elapsed wall time for I/O: 23.937832 seconds
- cachesize=1000: elapsed wall time for I/O: 10.472421 seconds
For reference the results on a local workstation HDD with deactivated HDD write cache, 16 threads:
- cachesize=1: elapsed wall time for I/O: 5.543722 seconds
- cachesize=2: elapsed wall time for I/O: 2.791811 seconds
- cachesize=3: elapsed wall time for I/O: 1.752962 seconds
- cachesize=4: elapsed wall time for I/O: 1.630385 seconds
- cachesize=5: elapsed wall time for I/O: 1.174099 seconds
- cachesize=10: elapsed wall time for I/O: 0.700624 seconds
- cachesize=20: elapsed wall time for I/O: 0.433936 seconds
- cachesize=50: elapsed wall time for I/O: 0.425782 seconds
- cachesize=100: elapsed wall time for I/O: 0.227552 seconds
As you can see the implementation is still embarrassingly slow on the Lustre file system compared to an ordinary HDD and I would need huge buffer sizes to reduce the I/O overhead to a tolerable extent. This would mean that the output lags behind which is against the requirements formulated earlier. Another promising approach was leaving the units open between consecutive writes. Unfortunately, the number of units open simultaneously is limited to typically 1024-4096 without root privileges. So this is not an option because the number of files can exceed this limit.
How could the I/O overhead be further reduced while still fulfilling the requirements?
Edit 1 From the discussion with Gilles I learned that the lustre file system can be tweaked even with normal user privileges. So I tried setting the stripe count to 1 as suggested (this was already the default setting) and decreased the stripe size to the minimum supported value of 64k (default value was 1M). However, this did not improve I/O performance with my test case. If anyone has additional hints on more suitable file system settings please let me know.