68

Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:

df1 <- data.frame(num = 1:5, let = letters[1:5])
df2 <- df1
df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
df4 <- df3

So this is what I do to compare them:

table(x == y, useNA = 'ifany')

Which works great when the datasets have no NAs:

> table(df1 == df2, useNA = 'ifany')
TRUE 
  10 

But not so much when they have NAs:

> table(df3 == df4, useNA = 'ifany')
TRUE <NA> 
  11    1 

In the example, it's easy to dismiss the NA as not a problem since we know that both dataframes are equal. The problem is that NA == <anything> yields NA, so whenever one of the datasets has an NA, it doesn't matter what the other one has on that same position, the result is always going to be NA.

So using table() to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?

P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R

Community
  • 1
  • 1
Waldir Leoncio
  • 9,134
  • 14
  • 68
  • 94
  • 31
    `identical(df1,df2)` – Metrics Oct 01 '13 at 15:08
  • @Frank, I believe the solutions are common and the problems are roughly the same (let's not get into semantics about the difference between a matrix and a data frame). However, to help future searches, I believe both Qs should be kept. BTW, your link targets this same page, here's the URL to that other question: http://stackoverflow.com/questions/11767851/regarding-matrix-comparison-in-r – Waldir Leoncio Oct 01 '13 at 17:53
  • Oh, oops. Thanks for getting the right link. I don't think there's anything wrong with having dupes, but it might be better to link them together (if/when they actually are dupes), based on what I've browsed on meta and the blog. Just a thought. – Frank Oct 01 '13 at 18:08
  • @Frank, I agree wholeheartedly, but how can we link them without marking one as a dupe of the other? – Waldir Leoncio Oct 01 '13 at 18:11
  • 3
    Yeah, I meant that we could mark this as a dupe, just because it came later. You have an answer, so I figured you wouldn't mind. If you agree, you could flag it for closure as a dupe or I could start a vote. (None have been started.) – Frank Oct 01 '13 at 18:37
  • 3
    @Frank: all right, I'll do it. It's harakiri time! – Waldir Leoncio Oct 01 '13 at 19:11
  • 3
    `dplyr::all_equal()` has arguments for ignoring column and row order, and for converting classes from factor to character and integer to double. – sbha Jul 17 '18 at 21:21

2 Answers2

88

Look up all.equal. It has some riders but it might work for you.

all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE
TheComeOnMan
  • 11,085
  • 6
  • 35
  • 50
  • I just got to know this function and will further test it to see if it really works for this particular task, but so far, so good. Thanks! – Waldir Leoncio Oct 01 '13 at 14:55
  • 18
    It's important to note that if the items being compared are NOT equal, then `all.equal` will *not* return `FALSE`. Instead, you have to use `isTRUE( all.equal(df2,df1) )` to get a `TRUE/FALSE` output from `all.equal` – Ricardo Saporta Oct 01 '13 at 16:41
  • 2
    @RicardoSaporta, you're right, but in that case I believe it is better to just go ahead and use `identical()`, as @Metrics suggested above. The thing about `all.equal()` is that returns a vector "describing the differences between target and current", which can be good or bad depending on what kind of output you're looking for. – Waldir Leoncio Oct 01 '13 at 18:07
  • 9
    `dplyr::all_equal()` is another option. By default it ignores column and row order, and is sensitive to variable classes, but those defaults can be overidden: `dplyr::all_equal(target, current, ignore_col_order = FALSE, ignore_row_order = FALSE, convert = TRUE)` – sbha Jul 04 '18 at 16:11
  • For my two big data frames and `identical(df2,df1)` returns `FALSE` but `isTRUE(all.equal(df2,df1))` returns `TRUE` (with `all_equal()` also). Any idea why ? – Dan Chaltiel Jul 24 '18 at 10:45
32

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:

> identical(df1, df3)
[1] FALSE

> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"                                
[2] "Component 1: Numeric: lengths (5, 6) differ"                                                
[3] "Component 2: Lengths: 5, 6"                                                                 
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"   

Moreover, from what I've tested identical() seems to run much faster than all.equal().

Community
  • 1
  • 1
Waldir Leoncio
  • 9,134
  • 14
  • 68
  • 94