2

I have this dataframe with some data that I'm triying to extract, I don't actually have a problem but I feel like there should be a better / more elegant way to do it.

So, I have this string

CVEGEO=0901500011337<BR>CVE_ENT=09<BR>CVE_MUN=015<BR>CVE_LOC=0001<BR>CVE_AGEB=1337<BR>

136 times and I'm interested on MUN=(.*) and AGEB=(.*)

To obtain the info I use:

test1 <- sub(".*_MUN=(.*)<BR>CVE_LOC=0001<BR>CVE_AGEB=(.*)<.*", "\\1_\\2", L1_AGEB$description)
str_split_fixed(test1, "_", 2)

And it works just fine, but, like I said, and this is just for academic/improvement purposes Is there an easier/elegant way?

Thank you

René Martínez
  • 144
  • 2
  • 11

4 Answers4

1

Definitely take a look at the rex package, it has a learning curve, but it can be pretty nifty:

library(rex)

rex::re_matches("CVEGEO=0901500011337<BR>CVE_ENT=09<BR>CVE_MUN=015<BR>CVE_LOC=0001<BR>CVE_AGEB=1337<BR>",
                pattern = rex::rex(
                  "MUN=",
                  capture(any_numbers, name = "MUN"),
                  anything,
                  "AGEB=",
                  capture(any_numbers, name = "AGEB")
                ))
  MUN AGEB
1 015 1337
Alexis
  • 4,004
  • 1
  • 14
  • 27
1

We can completely parse the entire input by converting the input to DCF format. This has the advantage that any of the fields can easily be subsequently extracted.

Assuming the input x shown in the Note at the end, we can replace <BR> with newline and replace = with colon and then read what is left using read.dcf. No packages are used.

x2 <- gsub("=", ":", gsub("<BR>", "\n", x))
read.dcf(textConnection(x2))

giving this character matrix:

     CVEGEO          CVE_ENT CVE_MUN CVE_LOC CVE_AGEB
[1,] "0901500011337" "09"    "015"   "0001"  "1337"  
[2,] "0901500011337" "09"    "015"   "0001"  "1337"  
[3,] "0901500011337" "09"    "015"   "0001"  "1337"  

A variation of this using the magrittr package would be:

library(magrittr)
x %>%
  gsub("<BR>", "\n", .) %>%
  gsub("=", ":", .) %>%
  textConnection %>%
  read.dcf

Note

x <- "CVEGEO=0901500011337<BR>CVE_ENT=09<BR>CVE_MUN=015<BR>CVE_LOC=0001<BR>CVE_AGEB=1337<BR>"
x <- rep(x, 3)
G. Grothendieck
  • 211,268
  • 15
  • 177
  • 297
0

You may use a regmatches / regexpr approach with a PCRE regex that will extract 1+ digits after the known "prefixes":

x <- "CVEGEO=0901500011337<BR>CVE_ENT=09<BR>CVE_MUN=015<BR>CVE_LOC=0001<BR>CVE_AGEB=1337<BR>"
regmatches(x, regexpr("_MUN=\\K\\d+", x, perl=TRUE))
## => [1] "015"
regmatches(x, regexpr("_AGEB=\\K\\d+", x, perl=TRUE))
## => [1] "1337"

See the R demo online.

Regex details

  • _MUN= - a _MUN text
  • \K - match reset operator that discards the text matched so far
  • \d+ - 1+ digits.

The use of perl=TRUE is crucial for the regex to work.

Equivalent using stringr:

library(stringr)
str_extract(x, "(?<=_MUN=)\\d+")
str_extract(x, "(?<=_AGEB=)\\d+")

The (?<=...) positive lookbehind only checks for the pattern match immediately to the left of the current location, but does not consume the text, i.e. does not put it into the match value.

And a fancy solution with stringr::str_match capturing the results in one go into Columns 2 and 3:

library(stringr)
str_match(x, "_MUN=(\\d+).*_AGEB=(\\d+)")
#      [,1]                                        [,2]  [,3]  
# [1,] "_MUN=015<BR>CVE_LOC=0001<BR>CVE_AGEB=1337" "015" "1337"
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
0

This answer is inefficient. Here, maybe, we'd just use [0-9] instead of \d which might insignificantly perform better with regards to time and space complexities, that I'm just guessing, and just like you've mentioned, your original expression is just fine, lookaround is usually not recommended, nor any other fancy methods, when we'd work with regular expressions.

MUN=([0-9]+).+AGEB=([0-9]+)

Demo

I'm pretty sure there should be other ways to improve upon what we wish to accomplish here, yet the key point is that, your original expression is already following a right path, but maybe not the right one, which for this reason we'd have likely traded off the desired elegancy.

Please see other views in the comments, and I'm only referencing, don't really have an opinion/recommendation here, even though it might sound otherwise.

References

Emma
  • 1
  • 9
  • 28
  • 53
  • I would be careful with a 2013 link post. Today, `\d` matches these Unicode ranges in UTF-8/32 regex: `[0-9٠-٩۰-۹߀-߉०-९০-৯੦-੯૦-૯୦-୯௦-௯౦-౯೦-೯൦-൯෦-෯๐-๙໐-໙༠-༩၀-၉႐-႙០-៩᠐-᠙᥆-᥏᧐-᧙᪀-᪉᪐-᪙᭐-᭙᮰-᮹᱀-᱉᱐-᱙꘠-꘩꣐-꣙꤀-꤉꧐-꧙꧰-꧹꩐-꩙꯰-꯹0-9----------------------]` –  Jun 20 '19 at 00:28
  • As performance goes, ranges in classes is the slowest, followed by sequences. Speedwise, `\d` is an intrinsic api function `isdigit()` which takes a shorter time than the others. `Regex1: [0123456789] Matches found per iteration: 62 Elapsed Time: 2.36 s, 2363.61 ms, 2363615 µs Matches per sec: 1,311,550 Regex2: [0-9] Matches found per iteration: 62 Elapsed Time: 2.62 s, 2620.44 ms, 2620444 µs Matches per sec: 1,183,005 Regex3: \d Matches found per iteration: 62 Elapsed Time: 2.20 s, 2196.97 ms, 2196974 µs Matches per sec: 1,411,031` –  Jun 20 '19 at 00:39