-3

I'm trying to figure out how to remove citation numbers from a string in the following way using R:

Original string:

"There were 100 people outside.231 They were sharing 10 hotdogs.42 Nice!"

Desired string:

"There were 100 people outside. They were sharing 10 hotdogs. Nice!"

but I'm admittedly very bad with Regex. Would anyone possibly have any ideas? Thanks!

honeymoow
  • 43
  • 4

2 Answers2

2

You can try (?<=\\.)\\d+ to match the digits after a period, e.g.,

> gsub("(?<=\\.)\\d+", "", s, perl = TRUE)
[1] "There were 100 people outside. They were sharing 10 hotdogs. Nice!"

A more efficient way (thank @JvdV's comment) might be

gsub("\\.\\d+", ".", s, perl = TRUE)
ThomasIsCoding
  • 53,240
  • 4
  • 13
  • 45
  • I think the lookbehind is creating a bit too much overhead than needed. Maybe just `gsub("\\.\\d+", ".", s, perl = TRUE)`? Where this would only take a few steps (7), the lookbehind takes [100+](https://regex101.com/r/GnYADL/2). – JvdV Jan 16 '21 at 19:33
  • @JvdV Thanks for the suggestion, I added yours into the answer – ThomasIsCoding Jan 16 '21 at 21:19
0

To remove a number after a period (you could make this more elaborate for any edge cases you might have) you can find a word, a period and then group capture the integers. Replace those integers with an empty string.

What I do here is I find two groups. First group is a word with a period and the second group are integers that follow it. I return just the word (first group denoted by \\1), discarding the integers.

> gsub("(\\w\\.)(\\d+)", replacement = "\\1", x = xy, perl = TRUE)
[1] "There were 100 people outside. They were eating hotdogs. Nice!"
Roman Luštrik
  • 64,404
  • 24
  • 143
  • 187