How can I compare two files in golang?

Question

With Python I can do the next:

equals = filecmp.cmp(file_old, file_new)

Is there any builtin function to do that in go language? I googled it but without success.

I could use some hash function in hash/crc32 package, but that is more work that the above Python code.

Can you clarify the question? It's asking for two different things (a replacement for `filecmp.cmp` and a way to see if two files contain the same bytes). — Paul Hankin, Apr 09 '15 at 03:24
Sure, I write an diff tool in Python (for self learning Python) which make patches comparing files and using the filecmp.cmp function to compare the new and the old file. Right now I'm writing the same tool using Go Lang and I cannot find some function like the above, thus my questions if for to find a builtin function to compare files, but, if it doesn't exist, I had suggested to use some hash function or write a byte-to-byte comparison function. Sorry my english — rvillablanca, Apr 09 '15 at 03:34

score 12 · Answer 1 · edited Oct 10 '19 at 11:19

12

To complete the @captncraig answer, if you want to know if the two files are the same, you can use the SameFile(fi1, fi2 FileInfo) method from the OS package.

SameFile reports whether fi1 and fi2 describe the same file. For example, on Unix this means that the device and inode fields of the two underlying structures are identical;

Otherwise, if you want to check the files contents, here is a solution which checks the two files line by line avoiding the load of the entire files in memory.

First try: https://play.golang.org/p/NlQZRrW1dT

EDIT: Read by bytes chunks and fail fast if the files have not the same size. https://play.golang.org/p/YyYWuCRJXV

const chunkSize = 64000

func deepCompare(file1, file2 string) bool {
    // Check file size ...

    f1, err := os.Open(file1)
    if err != nil {
        log.Fatal(err)
    }
    defer f1.Close()

    f2, err := os.Open(file2)
    if err != nil {
        log.Fatal(err)
    }
    defer f2.Close()

    for {
        b1 := make([]byte, chunkSize)
        _, err1 := f1.Read(b1)

        b2 := make([]byte, chunkSize)
        _, err2 := f2.Read(b2)

        if err1 != nil || err2 != nil {
            if err1 == io.EOF && err2 == io.EOF {
                return true
            } else if err1 == io.EOF || err2 == io.EOF {
                return false
            } else {
                log.Fatal(err1, err2)
            }
        }

        if !bytes.Equal(b1, b2) {
            return false
        }
    }
}

edited Oct 10 '19 at 11:19

HClx

451
4
5

answered May 04 '15 at 19:39

Pith

3,300
3
27
41

2

Why the overhead of a scanner? That needs to parse the bytes looking for line separators which you don't care about. It also may not do what you expect for binary files. You can just read "chunks" into a pair of reasonably sized buffers and use `bytes.Equal` as you go (which is what @captncraig suggests). – Dave C May 04 '15 at 19:42
BTW, it definitely won't work for binary files without frequent enough 0x0A bytes: "Scanning stops unrecoverably at EOF, the first I/O error, or **a token too large to fit in the buffer**." (From [bufio.Scanner](https://golang.org/pkg/bufio/#Scanner)). – Dave C May 04 '15 at 19:58
Thanks for your feedback. I edited my answer to follow your advice. Do you have an idea of a good chunk size default ? – Pith May 04 '15 at 20:15
1

4k, 8k, 64k, or 128k are likely choices for "real" code reading from files but anything is fine as an example. In general with an `io.Reader` you'd also have to handle short reads (or use `io.ReadFull` and deal with `io.ErrUnexpectedEOF`); `os.File` doesn't seem to guarantee it won't give a short read. All the corner cases start to get annoying :(. Probably not worth dealing with in an SO example, however. – Dave C May 04 '15 at 21:41
Readers are allowed to return a partially filled buffer even if more data will be available later as the docs say `If some data is available but not len(p) bytes, Read conventionally returns what is available instead of waiting for more.` So here `f1` and `f2` could get out of sync while reading. – mat007 Sep 16 '20 at 12:26

captncraig · Accepted Answer · 2015-04-08T05:09:39.967

I am not sure that function does what you think it does. From the docs,

Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Your call is comparing only the signature of os.stat, which only includes:

File mode
Modified Time
Size

You can learn all three of these things in Go from the os.Stat function. This really would only indicate that they are literally the same file, or symlinks to the same file, or a copy of that file.

If you want to go deeper you can open both files and compare them (python version reads 8k at a time).

You could use an crc or md5 to hash both files, but if there are differences at the beginning of a long file, you want to stop early. I would recommend reading some number of bytes at a time from each reader and comparing with bytes.Compare.

score 7 · Answer 3 · edited Nov 18 '16 at 18:41

7

How about using bytes.Equal?

package main

import (
"fmt"
"io/ioutil"
"log"
"bytes"
)

func main() {
    // per comment, better to not read an entire file into memory
    // this is simply a trivial example.
    f1, err1 := ioutil.ReadFile("lines1.txt")

    if err1 != nil {
        log.Fatal(err1)
    }

    f2, err2 := ioutil.ReadFile("lines2.txt")

    if err2 != nil {
        log.Fatal(err2)
    }

    fmt.Println(bytes.Equal(f1, f2)) // Per comment, this is significantly more performant.
}

edited Nov 18 '16 at 18:41

heemayl

32,535
3
52
57

answered Apr 09 '15 at 02:33

chaseadamsio

818
5
10

3

Two problems with this post. 1. you are encouraging loading all data into memory. 2. DeepEqual uses reflection and is slow. It makes more sense to use bytes.Equal and if such a function did not exist, I would recommend a for loop. – Stephen Weinberg Apr 09 '15 at 02:45
Updated per @StephenWeinberg, 1. good point. 2. bytes.Equal does exist and you're right, it's significantly faster than reflecting, updated code snippet. – chaseadamsio Apr 09 '15 at 12:08
Updated per @Dave C 3. I was "lazy" in this example (I also didn't have a package declaration or a main function, so this code would error if someone copy-pasted it), so I handled the errors and updated any code that wouldn't have compiled and ran. Hope that satisfies your problem with my answer. – chaseadamsio Apr 09 '15 at 12:08
1

You did not solve problem 1. You are still loading both files completely into memory. You did solve problems 2 and 3. – Stephen Weinberg Apr 09 '15 at 13:37
Sorry, I didn't mean to imply I solved your problem, but made it clear in the comments that it was just a trivial example on the chance that someone copy/pasted the example and had problems. What's an alternative solution you'd like to propose? I'm happy to delete my response if you think it needs to be removed because it's encouraging bad conduct and it's bad enough to be worthy of down votes. – chaseadamsio Apr 09 '15 at 18:41
All `bytes.Equal()` does is: `return string(a) == string(b)`. See https://github.com/golang/go/blob/master/src/bytes/bytes.go – jftuga May 04 '20 at 02:37

score 1 · Answer 4 · answered Nov 18 '16 at 17:31

1

You can use a package like equalfile

Main API:

func CompareFile(path1, path2 string) (bool, error)

Godoc: https://godoc.org/github.com/udhos/equalfile

Example:

package main

import (
    "fmt"
    "os"
    "github.com/udhos/equalfile"
 )

func main() {
    if len(os.Args) != 3 {
        fmt.Printf("usage: equal file1 file2\n")
        os.Exit(2)
    }

    file1 := os.Args[1]
    file2 := os.Args[2]

    equal, err := equalfile.CompareFile(file1, file2)
    if err != nil {
        fmt.Printf("equal: error: %v\n", err)
        os.Exit(3)
    }

    if equal {
        fmt.Println("equal: files match")
        os.Exit(0)
    }

    fmt.Println("equal: files differ")
    os.Exit(1)
}

answered Nov 18 '16 at 17:31

Everton

8,357
8
35
52

2

`const defaultMaxSize = 10000000000 // Only the first 10^10 bytes are compared.` what the hell – youfu Feb 07 '18 at 12:11
3

This default max size is a protection against a possibly unlimited stream that would cause a never-ending comparison. You can override it by using the option 'Options.MaxSize'. If you have a better strategy for handling infinite streams, please open a pull request. – Everton Feb 07 '18 at 19:34

score 0 · Answer 5 · answered Oct 23 '20 at 10:22

Here's an io.Reader I whipped out. You can _, err := io.Copy(ioutil.Discard, newCompareReader(a, b)) to get an error if two streams don't share equal contents. This implementation is optimized for performance by limiting unnecessary data copying.

package main

import (
    "bytes"
    "errors"
    "fmt"
    "io"
)

type compareReader struct {
    a    io.Reader
    b    io.Reader
    bBuf []byte // need buffer for comparing B's data with one that was read from A
}

func newCompareReader(a, b io.Reader) io.Reader {
    return &compareReader{
        a: a,
        b: b,
    }
}

func (c *compareReader) Read(p []byte) (int, error) {
    if c.bBuf == nil {
        // assuming p's len() stays the same, so we can optimize for both of their buffer
        // sizes to be equal
        c.bBuf = make([]byte, len(p))
    }

    // read only as much data as we can fit in both p and bBuf
    readA, errA := c.a.Read(p[0:min(len(p), len(c.bBuf))])
    if readA > 0 {
        // bBuf is guaranteed to have at least readA space
        if _, errB := io.ReadFull(c.b, c.bBuf[0:readA]); errB != nil { // docs: "EOF only if no bytes were read"
            if errB == io.ErrUnexpectedEOF {
                return readA, errors.New("compareReader: A had more data than B")
            } else {
                return readA, fmt.Errorf("compareReader: read error from B: %w", errB)
            }
        }

        if !bytes.Equal(p[0:readA], c.bBuf[0:readA]) {
            return readA, errors.New("compareReader: bytes not equal")
        }
    }
    if errA == io.EOF {
        // in happy case expecting EOF from B as well. might be extraneous call b/c we might've
        // got it already from the for loop above, but it's easier to check here
        readB, errB := c.b.Read(c.bBuf)
        if readB > 0 {
            return readA, errors.New("compareReader: B had more data than A")
        }

        if errB != io.EOF {
            return readA, fmt.Errorf("compareReader: got EOF from A but not from B: %w", errB)
        }
    }

    return readA, errA
}

score 0 · Answer 6 · answered Nov 17 '20 at 04:28

The standard way is to stat them and use os.SameFile.

-- https://groups.google.com/g/golang-nuts/c/G-5D6agvz2Q/m/2jV_6j6LBgAJ

os.SameFile should roughly do the same things as Python's filecmp.cmp(f1, f2) (ie. shallow=true, meaning it only compares the file infos obtained by stat).

func SameFile(fi1, fi2 FileInfo) bool

SameFile reports whether fi1 and fi2 describe the same file. For example, on Unix this means that the device and inode fields of the two underlying structures are identical; on other systems the decision may be based on the path names. SameFile only applies to results returned by this package's Stat. It returns false in other cases.

But if you actually want to compare the file's content, you'll have to do it yourself.

score 0 · Answer 7 · answered Jan 06 '21 at 16:48

After checking the existing answers I whipped up a simple package for comparing arbitrary (finite) io.Reader and files as a convenience method: https://github.com/hlubek/readercomp

Example:

package main

import (
    "fmt"
    "log"
    "os"

    "github.com/hlubek/readercomp"
)

func main() {
    result, err := readercomp.FilesEqual(os.Args[1], os.Args[2])
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(result)
}

Yobert · Answer 8 · 2021-01-02T21:12:42.190

Something like this should do the trick, and should be memory-efficient compared to the other answers. I looked at github.com/udhos/equalfile and it seemed a bit overkill to me. Before you call compare() here, you should do two os.Stat() calls and compare file sizes for an early out fast path.

The reason to use this implementation over the other answers is because you don't want to hold the entirety of both files in memory if you don't have to. You can read an amount from A and B, compare, and then continue reading the next amount, one buffer-load from each file at a time until you are done. You just have to be careful because you may read 50 bytes from A and then 60 bytes from B because your read may have blocked for some reason.

This implemention assumes a Read() call will not return N > 0 (some bytes read) at the same time as an error != nil. This is how os.File behaves, but not how other implementations of Read may behave, such as net.TCPConn.

import (
  "os"
  "bytes"
  "errors"
)

var errNotSame = errors.New("File contents are different")

func compare(p1, p2 string) error {
    var (
        buf1 [8192]byte
        buf2 [8192]byte
    )

    fh1, err := os.Open(p1)
    if err != nil {
        return err
    }
    defer fh1.Close()

    fh2, err := os.Open(p2)
    if err != nil {
        return err
    }
    defer fh2.Close()

    for {
        n1, err1 := fh1.Read(buf1[:])
        n2, err2 := fh2.Read(buf2[:])

        if err1 == io.EOF && err2 == io.EOF {
            // files are the same!
            return nil
        }
        if err1 == io.EOF || err2 == io.EOF {
            return errNotSame
        }
        if err1 != nil {
            return err1
        }
        if err2 != nil {
            return err2
        }

        // short read on n1
        for n1 < n2 {
            more, err := fh1.Read(buf1[n1:n2])
            if err == io.EOF {
                return errNotSame
            }
            if err != nil {
                return err
            }
            n1 += more
        }
        // short read on n2
        for n2 < n1 {
            more, err := fh2.Read(buf2[n2:n1])
            if err == io.EOF {
                return errNotSame
            }
            if err != nil {
                return err
            }
            n2 += more
        }
        if n1 != n2 {
            // should never happen
            return fmt.Errorf("file compare reads out of sync: %d != %d", n1, n2)
        }

        if bytes.Compare(buf1[:n1], buf2[:n2]) != 0 {
            return errNotSame
        }
    }
}

This code looks good at first sight but has some issues due to the semantics of [io.Reader](https://golang.org/pkg/io/#Reader), e.g.: 1. If the first call to `Read` returns `io.EOF` and a non-zero count of bytes read - it is not necessarily true that the files are the same for files < 8K. It is allowed that a read that hits EOF can return the error and a non-zero number of bytes read in the same call. So it must be compared anyway. 2. If one of the reads returns `io.EOF` and the other is not, it may not be true that the files differ since one could be a "short read". — Christopher, Dec 21 '20 at 19:19
@Christopher Aha! good catch, though I think most implementations of Read() such as the one in "os" for os.File will never return(n > 0, EOF). They instead return(n > 0, nil), and then on the next call to read they return (0, EOF). But it looks like you're right about TCP connections in the base "net" package-- those may return some bytes, and an error, if I understand the docs correctly. — Yobert, Jan 02 '21 at 21:09
@Christopher I updated the text to make sure to note that caveat. Thanks! — Yobert, Jan 02 '21 at 21:13

How can I compare two files in golang?

8 Answers8