54

The use and effects of the O_SYNC and O_DIRECT flags is very confusing and appears to vary somewhat among platforms. From the Linux man page (see an example here), O_DIRECT provides synchronous I/O, minimizes cache effects and requires you to handle block size alignment yourself. O_SYNC just guarantees synchronous I/O. Although both guarantee that data is written into the hard disk's cache, I believe that direct I/O operations are supposed to be faster than plain synchronous I/O since they bypass the page cache (Though FreeBSD's man page for open(2) states that the cache is bypassed when O_SYNC is used. See here).

What exactly are the differences between the O_DIRECT and O_SYNC flags? Some implementations suggest using O_SYNC | O_DIRECT. Why?

tshepang
  • 10,772
  • 21
  • 84
  • 127

4 Answers4

100

O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.

O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.

O_DIRECT|O_SYNC is the combination of these, i.e. "DMA + guarantee".

Edward
  • 3,883
  • 7
  • 32
  • 76
Damon
  • 61,669
  • 16
  • 122
  • 172
  • 5
    This answer is wrong regarding O_SYNC. It does guarantee that the data has been transferred to the medium. The kernel will set the FUA (Force Unit Access) flag on the write if available, or it will send a separate command to flush the write cache. – Paolo Bonzini Jul 10 '16 at 18:04
  • 1
    @PaoloBonzini : O_SYNC including FUA behaviour depends upon operating system - e.g. back in 2013 Linux did and FreeBSD didn't (see Christoph's answer over on http://serverfault.com/a/585427/303019 ) – Anon Jul 19 '16 at 05:46
  • @Anon: That'd be a bug in FreeBSD. – Paolo Bonzini Jul 20 '16 at 06:49
  • @PaoloBonzini: The answer is not wrong, although FUA may (incidentially) happen to be the case with Linux. POSIX says _"shall complete as defined by synchronized I/O file integrity completion"_, and that is defined as: _"succeeds when the data specified in the write request is successfully transferred"_. Transferred and on disk are not the same. Transferred means the disk controller got the command, nothing more. A particular OS may do something different, but POSIX doesn't guarantee that. No, it is not a bug in FreeBSD to do something different. – Damon Dec 07 '17 at 10:18
  • 2
    @Damon: "successfully transferred" is defined for writes as ensuring "that all data written is readable on any subsequent open of the file (even one that follows a system **or power** failure) in the absence of a failure of the physical storage medium." (http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html, emphasis mine). So FreeBSD _does_ have a bug—but even if it didn't, its O_SYNC implementation would be completely useless except on disks with a non-volatile cache. – Paolo Bonzini Dec 08 '17 at 10:42
  • Modern enterprise SSD has power-off data protection for write cache. Linux offers mount option `nobarrier` for this case. – stark Sep 05 '18 at 12:49
  • @stark `barrier`/`nobarrier` for filesystems has been deprecated for a while. In the [2.6.37 kernel the block layer dropped `barrier` semantics](https://kernelnewbies.org/Linux_2_6_37#Block) (see the ["block, fs: replace HARDBARRIER with FLUSH/FUA"](https://lore.kernel.org/lkml/1282751267-3530-1-git-send-email-tj@kernel.org/) thread for patches that convert filesystems to use flushing). [XFS no longer accepts `barrier` in the 4.19 kernel](https://github.com/torvalds/linux/commit/1c02d502c20809a2a5f71ec16a930a61ed779b81). A disk/controller can choose to ignore a flush if it protects data so... – Anon Aug 29 '19 at 06:02
  • @Damon Do you think your answer could be subtly changed to indicate that the cache being bypassed by `O_DIRECT` is (only) the one for the kernel (i.e. with `O_DIRECT` alone write data could be only in a **disk's** volatile cache) and that `O_SYNC` (is supposed) to guarantee that it's reached non-volatile storage? I'm basing this on https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics#Allocating_writes , https://lwn.net/Articles/457667/ and https://github.com/zfsonlinux/zfs/issues/224#issuecomment-124153060 (obviously this info is Linux-centric). – Anon Aug 29 '19 at 06:26
  • @Anon: That would be a valid thing to say, yes. `O_DIRECT` is not truly a standard, it's _something something_ no copying, no caching, _something_. In theory, at least. `O_SYNC` on the other hand is exactly specified and gives a mostly well-defined guarantee (_"synchronized I/O file integrity completion"_), at least in theory. In practice, `O_DIRECT` kinda guarantees "on disk" since modern storage devices usually have enough redidual power to finish a write anyway, plus you normally have UPS, too. So even if power fails, what hits the controller is pretty safe. Unless, well, stuff like... – Damon Aug 29 '19 at 10:40
  • ... first allocating new blocks and the like, changing inodes, blah. Then of course, `O_DIRECT` is much less "direct" than you would think, and in no way reliable or guaranteed. `O_SYNC` on the other hand does give a guarantee (kind of, to the best of its ability) since it only returns when everything is done (to the best of its knowledge). I find `O_DIRECT` pretty useless anyway, since it never really does anything good. Bypassing cache sounds like a very intelligent thing to do, but it's actually quite stupid. It's a serious anti-optimization. – Damon Aug 29 '19 at 10:41
  • 1
    @Damon I wouldn't go so far as to say it's pretty useless (but you did use a qualifier so I know you know :-) but I agree its name falsely implies "turbo mode" when it only helps in niche scenarios. It's useful when kernel caching is absolutely not helping you but if you're trying to use it for speed you must add the requirement that your disk is _so_ fast that your CPU is struggling (relatively) to keep up AND you are able to line all your ducks up in a row (alignment, enough/"big enough" I/Os, fully provisioned etc). E.g. I saw a benefit in https://stackoverflow.com/a/48973798/2732969 . – Anon Aug 31 '19 at 08:26
15

Please see this lwn article for a clear description of the roles of O_DIRECT and O_SYNC and their impact on data integrity:

https://lwn.net/Articles/457667/

jmoyer
  • 159
  • 1
  • 2
  • Thanks a lot, @jmoyer, this link is actually very helpful and self-explanatory. – SRG Sep 04 '20 at 06:54
9

Actuall under linux 2.6, o_direct is syncronous, see the man page:

manpage of open, there is 2 section about it..

Under 2.4 it is not guaranteed

O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. Ingeneral this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.

A semantically similar (but deprecated) interface for block devices is described in raw(8).

but under 2.6 it is guaranteed, see

O_DIRECT

The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).

Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.

O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).

The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.

O_DIRECT support was added under Linux in kernel version 2.4.10. Older Linux kernels simply ignore this flag. Some file systems may not implement the flag and open() will fail with EINVAL if it is used.

Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the file system correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.

The behaviour of O_DIRECT with NFS will differ from local file systems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will only bypass the page cache on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.

In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."---Linus

HardySimpson
  • 1,006
  • 1
  • 11
  • 14
1

AFAIK, O_DIRECT bypasses the page cache. O_SYNC uses page cache but syncs it immediately. Page cache is shared between processes so if there is another process that is working on the same file without O_DIRECT flag can read the correct data.

Rumple Stiltskin
  • 7,691
  • 1
  • 18
  • 24