12

I use the gsub function in R to remove unwanted characters in numbers. So I should remove from the strings every character that is not a number, ., and -. My problem is that the regular expression is not removing some non-numeric characters like d, +, and <.

Below are my regular expression, the gsub execution, and its output. How can I change the regular expression in order to achieve the desired output?

Current output:

gsub(pattern = '[^(-?(\\d*\\.)?\\d+)]', replacement = '', x = c('1.2<', '>4.5', '3+.2', '-1d0', '2aadddab2','1.3h'))
[1] "1.2<"  ">4.5"  "3+.2"  "-1d0"  "2ddd2" "1.3"

Desired output:

[1] "1.2"  "4.5"  "3.2"  "-10"  "22" "1.3"

Thank you.

jair.jr
  • 397
  • 1
  • 5
  • 13
  • Use `gsub("-", "+", x, fixed=TRUE)` – Wiktor Stribiżew Oct 09 '18 at 13:58
  • Just get rid of the `?` Use `gsub(pattern = '-', replace='+', x = c('a', 'bc', '-'))` – G5W Oct 09 '18 at 13:59
  • It works if you wrap `[]` around what you want to find: `gsub(pattern = '[-?]', replace='+', x = c('a', 'bc', '-'))` – Stanislaus Stadlmann Oct 09 '18 at 13:59
  • 2
    not a helpful comment @StanislausStadlmann. Should be `gsub(pattern = '[-]', replace='+', x = c('a', 'bc', '-'))`. He does not intend to match a possible `?` character. – Andre Elrico Oct 09 '18 at 14:02
  • @WiktorStribiżew, the example I provided is a simplification. In my real problem, I need the `?` quantifier, so I cannot use `fixed = TRUE`. – jair.jr Oct 09 '18 at 14:23
  • You know what to do: update the question with the real life scenario. Then let me know. Right now, I see no point in reopening the post. – Wiktor Stribiżew Oct 09 '18 at 14:24
  • `?` means zero or more instances of the preceding character, so your code adds a + basically *everywhere*, whether there is or isn't a -. Instead, you can use either `-+` (one or more -) or `-` (exactly one minus), like so: `gsub(pattern = '-+', replace='+', x = c('a', 'bc', '-'))` – iod Oct 09 '18 at 14:43
  • @jair.jr please provide a good example in order for us to provide a good solution. – Andre Elrico Oct 09 '18 at 14:46
  • @WiktorStribiżew, I updated the question with the real life scenario. Thank you. – jair.jr Oct 09 '18 at 14:57
  • 1
    `gsub("[^0-9.-]", "", x)` – Andre Elrico Oct 09 '18 at 14:57
  • The approach im taking is. Remove everything except `[^THOSE VALUES]`. – Andre Elrico Oct 09 '18 at 15:04
  • As @AndreElrico points out, a caret as the first character in square brackets means you're selecting for everything *except* what's in the square brackets (rather than its normal meaning as the beginning of a string), so what your code actually does is remove everything except the characters `(-?(\\d*\\.)?\\d+)` – iod Oct 09 '18 at 15:33
  • What if the value is `67-.- – Wiktor Stribiżew Oct 09 '18 at 15:46
  • Can there be `5g.g6h.h7hh.8`, `9-9.8.9`? There are now too many unclear moments here. – Wiktor Stribiżew Oct 09 '18 at 15:54
  • If Andre's regex works for you please let the user know, so that an answer could be posted. – Wiktor Stribiżew Oct 09 '18 at 18:33
  • Thank you, @AndreElrico. Your suggestion is working for me. I just have to implement an extra step to parse the resulting expression to a number, and discard the value if there's an error during parse. Just for the record, yes there can be `5g.g6h.h7hh.8, 9-9.8.9` or `67-.- – jair.jr Oct 09 '18 at 18:49

1 Answers1

42

Simply use

gsub("[^0-9.-]", "", x)

You can in case of multiple - and . have a second regEx dealing with that. If you struggle with it, open a new question.


(Make sure to change . with , if needed)

Andre Elrico
  • 8,959
  • 1
  • 37
  • 61