split a string but ignore separators surrounded by given characters

Question

I would like to split a string but only use the separator if it's not surrounded by given sets of characters

current :

strsplit("1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}","\\?")
#> [[1]]
#> [1] "1 "   " 2 "  " (3 " " 4) " " {5 " " (6 " " 7)}"

expected :

strsplit2 <- function(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE,
                      escape = c("()","{}","[]","''",'""',"%%")){
  # ... 
}
strsplit2("1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}","\\?")
#> [[1]]
#> [1] "1 "   " 2 "  " (3 ? 4) " " {5 ? (6 ? 7)}"

I solved this with some complex parsing but I worry about the performance and wonder if regex can be faster.

FYI :

My current solution (not really that relevant to the question) is :

parse_qm_args <- function(x){
  x <- str2lang(x)
  # if single symbol
  if(is.symbol(x)) return(x)
  i <- numeric(0)
  out <- character(0)
  while(identical(x[[c(i,1)]], quote(`?`)) &&
        (!length(i) || length(x[[i]]) == 3)){
    out <- c(x[[c(i,3)]],out)
    i <- c(2, i)
  }
  # if no `?` was found
  if(!length(out)) return(x)

  if(length(x[[i]]) == 2) {
    # if we have a unary `?` fetch its arg
    out <-  c(x[[c(i,2)]],out)
  } else {
    # if we have a binary `?` fetch the its first arg
    out <-  c(x[[c(i)]], out)
  }
  out
}

This seems relevant: https://stackoverflow.com/questions/1757065/java-splitting-a-comma-separated-string-but-ignoring-commas-in-quotes?rq=1 but I'm not sure how to generalize it — Moody_Mudskipper, Sep 25 '19 at 16:36
Are your braces always balanced ? if yest then easiest thing you can do is iterate over the string, keep track of opening braces and `?`, split down only when you end reach to seperator and opening and closing braces are equal, else replace `?` with `+` as desired in your output string — Code Maniac, Sep 25 '19 at 16:38
parsing is always going to be faster then back references, here you will need back reference for your use case, IMO using simple parsing is enough to get what you're expecting — Code Maniac, Sep 25 '19 at 16:40
In expansion of my first comment. Something like [`this`](https://jsbin.com/wudobebuna/edit?js,console) can be done easily, i don't think regex can get faster then simple parsing when it requires back reference, — Code Maniac, Sep 25 '19 at 16:54
This kind of for loop is slow in R unfortunately, but I guess it could be implemented in C++ using Rcpp... — Moody_Mudskipper, Sep 25 '19 at 16:57
@Moody_Mudskipper you can fine tune it as per language, i just intended to show the logic, hope it helps :) — Code Maniac, Sep 25 '19 at 17:04
it does thank you, I'm working at translating it now, but R is notorious to be slow with for loops because of memory allocation issues, so I believe it will be slow. Also the given case, though it solves my example, doesn't deal with identical sets of separators as quotes so i'll need to tweak it a bit. — Moody_Mudskipper, Sep 25 '19 at 17:10
apologies, my expected output had some `+` instead of `?`, now corrected — Moody_Mudskipper, Sep 25 '19 at 17:15
@Moody_Mudskipper first i am hearing `for loops are slow`,i am not familiar with `r` though, anyways, if you don't need `+` then the last `if else` statement is not longer required, [`this`](https://jsbin.com/begoxifuzo/edit?js,console) — Code Maniac, Sep 25 '19 at 17:22
Also if your concern is memory allocation, what you can do it use counter to mark start and end, instead of add value continuously and then slice value from start to end once you find separator where you want value to be splited — Code Maniac, Sep 25 '19 at 17:24

score 2 · Accepted Answer · answered Sep 25 '19 at 18:25

The best idea will be to use recursion. In that case, you will capture all the grouped elements together then split on the ungrouped deliminator :

pattern = "([({'](?:[^(){}']*|(?1))*[')}])(*SKIP)(*FAIL)|\\?"

x1 <- "1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}"
x2 <- "1 ? 2 ? '3 ? 4' ? {5 ? (6 ? 7)}"
x3 <- "1 ? 2 ? '3 {(? 4' ? {5 ? (6 ? 7)}"
x4 <- "1 ? 2 ? '(3 ? 4) ? {5 ? (6 ? 7)}'"

strsplit(c(x1,x2,x3, x4),pattern,perl=TRUE)

 [[1]]
[1] "1 "             " 2 "            " (3 ? 4) "      " {5 ? (6 ? 7)}"

[[2]]
[1] "1 "             " 2 "            " '3 ? 4' "      " {5 ? (6 ? 7)}"

[[3]]
[1] "1 "             " 2 "            " '3 {(? 4' "    " {5 ? (6 ? 7)}"

[[4]]
[1] "1 "                         " 2 "                        " '(3 ? 4) ? {5 ? (6 ? 7)}'"

`(?:[^(){}']*|(?1))*` will make it very slow. Besides, it might appear that there can be unbalanced `(` and `)` inside `{` and `}` and vice versa. I would use a more precise regex with more alternations, probably "unrolled". — Wiktor Stribiżew, Sep 25 '19 at 19:45

score 1 · Answer 2 · answered Sep 25 '19 at 17:50

(*SKIP)(*FAIL) and perl = T is your friend here:

some_string <- c("1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}")

pattern <- c("(?:\\{[^{}]*\\}|\\([^()]*\\))(*SKIP)(*FAIL)|\\?")
some_parts <- strsplit(some_string, pattern, perl = T)
some_parts

This yields

[[1]]
[1] "1 "             " 2 "            " (3 ? 4) "      " {5 ? (6 ? 7)}"

See a demo on regex101.com. This won't work for nested constructs.

Moody_Mudskipper · Answer 3 · 2019-09-25T18:41:58.670

Here is an implementation of @CodeManiac's idea with some optimisation and dealing with edge cases.

splitter <- function(x) {
  str <- strsplit(x,"")[[1]]
  final <- character(0)
  strTemp <- ""
  count <- 0
  # define escape sets
  parensStart <- c("{","(")
  parensClosing <- c("}",")")
  parensBoth <- c("'",'"', "%")
  quotes_on <- FALSE
  for(i in 1:nchar(x)){
    if(str[i] %in% parensBoth){
      # handle quotes
      strTemp <- c(strTemp,str[i])
      if(!quotes_on) {
        quotes_on <- TRUE
        count <- 1 # no need to count here, just make it non zero
      } else {
        quotes_on <- FALSE
        count <- 0
      }
      i <- i + 1
      next
    }

    if(str[i] == "?" && count == 0){
      # if found `?` reinitialise strTemp and count and append final
      final <- c(final, paste(strTemp, collapse=""))
      strTemp <- ""
      count <- 0
      i <- i + 1
      next
    }

    strTemp <- c(strTemp,str[i])
    if(str[i] %in% parensStart){
      # increment count entering set
      count <- count+1
    } else if(str[i] %in% parensClosing){
      # decrement if exiting set
      count <- count-1
    }

    i <- i + 1
  }
  # append what's left
  final <- c(final, paste(strTemp, collapse=""))
  final
}

results :

x1 <- "1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}"
splitter(x1)
#> [1] "1 "             " 2 "            " (3 ? 4) "      " {5 ? (6 ? 7)}"
x2 <- "1 ? 2 ? '3 ? 4' ? {5 ? (6 ? 7)}"
splitter(x2)
#> [1] "1 "             " 2 "            " '3 ? 4' "      " {5 ? (6 ? 7)}"

An edge case I didn't think about when writing the question, characters between quotes are not candidates for separators

x3 <- "1 ? 2 ? '3 {(? 4' ? {5 ? (6 ? 7)}"
splitter(x3)
#> [1] "1 "             " 2 "            " '3 {(? 4' "    " {5 ? (6 ? 7)}"

benchmark

Parsing is 10 times faster so far, though the solution above might be optimised further by using Rcpp. The parsing solution might also be optimized further.

Jan's and Onyambu's solutions are much more compact and elegant. Onyambu's handles nesting, quotes, and the the edge case of separators trapped in quotes (though not part of the question), while Jan's doesn't. And they're approwimately as fast.

regex_split_jan <- function(x){
  pattern <- c("(?:\\{[^{}]*\\}|\\([^()]*\\))(*SKIP)(*FAIL)|\\?")
  out <- strsplit(x, pattern, perl = T)[[1]]
  out
}

regex_split_onyambu <- function(x){
  pattern <- c("([({'](?:[^(){}']*|(?1))*[')}])(*SKIP)(*FAIL)|\\?")
  out <- strsplit(x, pattern, perl = T)[[1]]
  out
}

microbenchmark::microbenchmark(
  regex_jan = as.list(parse(text=regex_split_jan(x))),
  regex_onyambu = as.list(parse(text=regex_split_onyambu(x))),
  loop  = as.list(parse(text=splitter(x))),
  parse = parse_qm_args(x)
)

#> Unit: microseconds
#>           expr   min     lq    mean median     uq    max neval cld
#>      regex_jan  89.1  92.15 112.114  92.95  94.45 1893.5   100   b
#>  regex_onyambu  91.0  93.50 116.850  94.95  96.45 2056.1   100   b
#>           loop 122.0 125.95 130.289 128.30 131.20  169.8   100   b
#>          parse  10.7  13.55  14.642  14.80  15.65   25.3   100  a

split a string but ignore separators surrounded by given characters

3 Answers3