6

I have a large dataset that I'm analyzing in R and I'm interested in one column or vector of information. Each entry in this vector has a varied number (ranging from 1-5) of significant figures, and I want to subset this vector so I'm not seeing data with only one significant digit. What kind of test or function can I use to get R to report the number of significant figures for each entry? I've looked into the signif() function but that is more for rounding data to a specified number of significant digits, not querying how many sig figs are there.

Example: Suppose I have this vector:
4
28.382
120
82.3
100
30.0003

I want to remove the entries that only have one significant digit. That would be entries 1 (value of 4) and entry 5 (value of 100). I know how to subset data in R, but I don't know how to tell R to "find" all the values with only one significant figure.

pocketlizard
  • 369
  • 1
  • 4
  • 10
  • This can go haywire in a hurry if you confuse the printed representation of stored floats with the actual stored values. While Roland's solution looks nice, I strongly recommend you convert your actual reported precisions as character strings and work from there. – Carl Witthoft Jan 04 '15 at 17:37

2 Answers2

7
x <- c(4, 28.382, 120, 82.3, 100, 30.0003)
#compare the values with result of signif
#you need to consider floating point precision
keep <- abs(signif(x, 1) - x) > .Machine$double.eps
x[keep]
#[1]  28.3820 120.0000  82.3000  30.0003
Roland
  • 117,893
  • 9
  • 163
  • 255
  • 1
    Possible bug (somewhat terminological): in the engineering world, the string "100" is considered to have one sigfig while "100.0" has four sigfigs, but your code treats the numeric values the same. Depending on exactly how the OP's values are created (and represented), this may present a problem. Maybe some function which checks for existence of a decimal point and adjusts the checking mechanism appropriately? (I see akrun deleted one such approach; I may revive it in a new answer) – Carl Witthoft Jan 04 '15 at 17:36
  • My dataset is so large and there have been thousands of people taking measurements and adding their entries, I think going more conservative is the safer choice. So even if there is a "100.0" that really was measured with such precision, the answer as written is sufficient. But you are right that this answer doesn't totally work for sigfigs in special cases such as you mentioned. – pocketlizard Jan 05 '15 at 21:54
2

I think this should be equivalent to Rolands solution.

x <- c(4, 4.0, 4.00, 28.382, 120,
       82.3, 100, 100.0, 30.0003)
x
ifelse(x == signif(x, 1), NA, x)
ifelse(x == signif(x, 2), NA, x)
ifelse(x == signif(x, 3), NA, x)

In any case, it at least has the same problem with giving the incorrect number of significant digits for cases like "4.00" and "100.0".

The solution is in part, as pointed out above, to treat the numbers as strings of characters. It isn't sufficient to simply convert the numbers to characters, they have to be read in as such, which takes a bit of care. The colClasses argument in the read.table group of functions can come in handy.

xc <- c("4", "4.0", "4.00", "28.382", "120",
        "82.3", "100", "100.0", "30.0003")
xc
# "4"  "4.0" "4.00" "28.382" "120" "82.3" "100" "100.0" "30.0003"
ifelse(xc == signif(as.numeric(xc), 1), NA, xc)
# "NA" "4.0" "4.00" "28.382" "120" "82.3" "NA"  "100.0" "30.0003"

Only "4" and "100" are removed. That looks promising, but if we go a bit further we see that not everything is quite as it ought to be.

ifelse(xc == signif(as.numeric(xc), 2), NA, xc)
# "NA" "4.0" "4.00" "28.382" "120" "82.3" "NA"  "100.0" "30.0003"
ifelse(xc == signif(as.numeric(xc), 3), NA, xc)
# "NA" "4.0" "4.00" "28.382" "120" "82.3" "NA"  "100.0" "30.0003"

The reason can be demonstrated like this

2 == "2"
# TRUE – only what's between the quotes is compared
2.0 == "2"; 02 == "2"
# TRUE
# TRUE – R removes what's considered numerically empty characters
2 == "2.0"
# FALSE – strings aren't modified.
2 == as.numeric("2.0")
# TRUE – that is, unless you explicitly request it.

It's also worth keeping in mind that comparisons of strings are based on alphanumerical order, even if the strings easily can be interpreted as numbers.

2 < "2.0"
# TRUE
2 > "2.0"
# FALSE
"2.0" < "2.00"
# TRUE
sort(xc)
# "100" "100.0" "120" "28.382" "30.0003" "4" "4.0" "4.00" "82.3" 

So far the only complete fix I've found for this problem is a little hacky. It consists of separating out the strings containing a decimal separator ("."), and replacing the last character of those strings with a "1" (or any non-zero digit). Thus turning "4.0" into "4.1", but leaving "100" as it is. This new vector is then used as the basis for comparison.

xc.1 <- xc
decimal <- grep(".", xc, fixed=TRUE)
xc.1[decimal] <- gsub(".$", "1", xc[decimal])
xc.1 <- as.numeric(xc.1)

xc
# "4"  "4.0" "4.00" "28.382" "120" "82.3" "100" "100.0" "30.0003"
ifelse(xc.1 == signif(xc.1, 1), NA, xc)
# "NA" "4.0" "4.00" "28.382" "120" "82.3" "NA"  "100.0" "30.0003"
ifelse(xc.1 == signif(xc.1, 2), NA, xc)
# "NA" "NA"  "4.00" "28.382" "NA"  "82.3" "NA"  "100.0" "30.0003"
ifelse(xc.1 == signif(xc.1, 3), NA, xc)
# "NA" "NA"  "NA"   "28.382" "NA"  "NA"   "NA"  "100.0" "30.0003"

If you want to actually count the number of significant digits, that can be done with a small loop.

n <- 7

# true counts
xc.count <- vector(length=length(xc.1))
for (i in n:1) xc.count[xc.1 == signif(xc.1, i)] <- i
xc.count
# 1 2 3 5 2 3 1 4 6

# simple counts
x.count <- vector(length=length(x))
for (i in n:1) x.count[x == signif(x, i)] <- i
x.count
# 1 1 1 5 2 3 1 1 6
Community
  • 1
  • 1
AkselA
  • 7,593
  • 2
  • 19
  • 31