split a string knowing some of the substrings

Question

Say I have the following string and a vector of substrings:

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")

I would like to split this string by extracting the substrings from my vector and making new substrings from the characters in between, so I would get the following :

res <- c("abc", "[[", "+", "de.f", "[", "-", "[[", "g")

in case of conflicting matches the longer wins (here [[ over [), you can consider there won't be conflicting matches of same length.

Tagging with regex but open to any solution, faster being better.

Please don't make any assumption on the type of character used in any of these strings, apart from the fact they're ASCII. There is no pattern to be inferred if I didn't explicitly mention it.

another example :

x <- "a*bc[[+de.f[-[[g[*+-h-+"
v <- c("+", "-", "[", "[[", "[*", "+-")
res <- c("a*bc", "[[", "+", "de.f", "[", "-", "[[", "g", "[*", "+-", "h", "-", "+")

Moody_Mudskipper · Answer 1 · 2019-04-11T20:20:48.337

2

Using stringr::str_match_all and Hmisc::escapeRegex :

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
tmp <- v[order(-nchar(v))] # sort to have longer first, to match in priority
tmp <- Hmisc::escapeRegex(tmp)
tmp <- paste(tmp,collapse="|")  # compile a match string
pattern <- paste0(tmp,"|(.+?)") # add a pattern to match the rest
# extract all matches into a matrix
mat <- stringr::str_match_all(op_chr, pattern)[[1]]
# aggregate where second column is NA
res <- unname(tapply(mat[,1], 
                     cumsum(is.na(mat[,2])) + c(0,cumsum(abs(diff(is.na(mat[,2]))))),
                     paste, collapse=""))
res
#> [1] "abc"  "[["   "+"    "de.f" "["    "-"    "[["   "g"

edited Apr 11 '19 at 20:20

answered Apr 11 '19 at 18:38

Moody_Mudskipper

39,313
10
88
124

Are you escaping *each* char with `lapply(tmp, function(x) paste0("\\",x,collapse=""))`? That is fraught with issues. – Wiktor Stribiżew Apr 11 '19 at 18:46
made it cleaner using `Hmisc::escapeRegex`, which has similar code as your `regex.escape` – Moody_Mudskipper Apr 11 '19 at 20:24
1

My `regex.escape` is the best in my opinion, as it only escapes those chars that might need escaping. I do not include spaces and `#` though, as one rarely uses free-spacing mode with R regexes. – Wiktor Stribiżew Apr 12 '19 at 06:43

score 2 · Answer 2 · answered Apr 11 '19 at 19:23

2

This almost seems more like a lexing problem than a matching problem. I seem to get decent results with the minilexer package

library(minilexer) #devtools::install_github("coolbutuseless/minilexer")

patterns <- c(
  dbracket  = "\\[\\[", 
  bracket   = "\\[",
  plus      = "\\+",
  minus     = "\\-",
  name      = "[a-z.]+"
)

x <- "abc[[+de.f[-[[g"
lex(x, patterns)
unname(lex(x, patterns))
# [1] "abc"  "[["   "+"    "de.f" "["    "-"   
# [7] "[["   "g"

answered Apr 11 '19 at 19:23

MrFlick

163,738
12
226
242

1

This works almost perfectly : `temp – Moody_Mudskipper Apr 11 '19 at 20:17
1

@Moody_Mudskipper If you look at the source for `lex()` it's also basically doing the `str_match_all()` method in your answer. There are probably better ways to build a proper fast lexer but that type of stuff is usually easier at the C level. It's probably going to be easier just to stick with your answer. – MrFlick Apr 11 '19 at 20:54

score 1 · Answer 3 · answered Apr 11 '19 at 18:33

1

One option to get your matches might be to us an alternation:

[a-z.]+|\[+|[+-]

[a-z.]+ Match 1+ times a-z or dot
| Or
\[+ match 1+ times a [
|` or
[+-] Match + or -

Regex demo | R demo

For example, to get the matches:

library(stringr)
x <- "abc[[+de.f[-[[g"
str_extract_all(x, "[a-z.]+|\\[+|[+-]")

answered Apr 11 '19 at 18:33

The fourth bird

96,715
14
35
52

Thanks, it's probably the way to go, but you're making strong assumptions about my data, the characters could be anything – Moody_Mudskipper Apr 11 '19 at 18:42
@Moody_Mudskipper It is based on the string in the example. You might extend the character class with the character that you want to allow. – The fourth bird Apr 11 '19 at 18:52
you made `[+` into an exception, and then encoded manually `+` and `-` in a way that works only for single characters, I need to start from the vector v – Moody_Mudskipper Apr 11 '19 at 19:06

Wiktor Stribiżew · Accepted Answer · 2020-06-08T13:30:55.963

A pure regex-based solution will look like

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]

pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
pat <- paste0("(?s)", pat, "|(?:(?!", pat, ").)+")
res <- regmatches(x, gregexpr(pat, x, perl=TRUE))
## => [[1]]
##    [1] "abc"  "[["   "+"    "de.f" "["    "-"    "[["   "g"

See this R demo online. The PCRE regex here is

(?s)\[\[|\+|-|\[|(?:(?!\[\[|\+|-|\[).)+

See the regex demo and the Regulex graph:

Details

(?s) - a DOTALL modifier that makes . match any char including newlines
\[\[ - [[ substring (escaped with regex.escape)
| - or
\+ - a +
|- - or a - (no need to escape - as it is not inside a character class)
|\[ - or [
| - or
(?:(?!\[\[|\+|-|\[).)+ - a tempered greedy token that matches any char (.), 1 or more repetitions as many as possible (+ at the end), that does not start a a [[, +, - or [ character sequences (learn more about tempered greedy token).

You may also consider a less "regex intensive" solution with a TRE regex:

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
## Interleaving function
riffle3 <- function(a, b) { 
  mlab <- min(length(a), length(b)) 
  seqmlab <- seq(length=mlab) 
  c(rbind(a[seqmlab], b[seqmlab]), a[-seqmlab], b[-seqmlab]) 
} 
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
res <- riffle3(regmatches(x, gregexpr(pat, x), invert=TRUE)[[1]], regmatches(x, gregexpr(pat, x))[[1]])
res <- res[res != ""]
## => [1] "abc"  "[["   "+"    "de.f" "["    "-"    "[["   "g"

See the R demo.

So, the search items are properly escaped to be used in regex, they are sorted by length in descending order, the regex pattern based on alternation is built dynamically, then all matching and non-matching strings are found and then they are joined into a single character vector and empty items are discarded in the end.

Thanks, this is what I used. I like that this is 100% base R too. I didn't do benchmarks though but in hindsight speed won't really be an issue. — Moody_Mudskipper, Apr 12 '19 at 06:55
@Moody_Mudskipper Note that a tempered greedy token is resource consuming, and it may turn out slow. However, it is the only way with plain regex since your data is dynamic. If you knew the `v` contents beforehand, you could manually optimize the pattern, but it is user-defined, there is no way for optimization. — Wiktor Stribiżew, Apr 12 '19 at 06:59

split a string knowing some of the substrings

4 Answers4