Is there an R function to capture a lot of patterns in a text?

Question

I have the following text in my dataset:

[1] "q negociação c/v tipo mercado prazo especificação do título obs (*) quantidade preço / ajuste valor operação / ajuste d/c 1-bovespa c fracionario magaz luiza on eb nm # 1 25,76 25,76 d 1-bovespa c fracionario magaz luiza on eb nm # 9 25,76 231,84 d 1-bovespa c fracionario magaz luiza on eb nm 40 25,76 1030,40 d 1-bovespa c fracionario mrv on ed nm 40 18,14 725,60 d resumo dos negócios"

I would like to extract the various texts between two standards, specifically the texts contained between "1-bovespa" and "d". Currently, I use the str_extract the readtext package but it does so for only the first pattern found. However, I would like the command to scroll through all the text, and as it finds the pattern again, build a data frame.

I'm trying something like this:

str_extract_all(out, "\\(1-bovespa).+?\\d")

Yes, `str_extract` does the first pattern found. Switch to `str_extract_all` to get all matches. They share a help page, see `?str_extract` for details. It will return a `list`, which you can convert to a vector/dataframe as you like. — Gregor Thomas, Apr 05 '21 at 14:39
I see you've edited your code to use `str_extract_all`. With that change, do you still have a problem? If so, what is it? — Gregor Thomas, Apr 05 '21 at 16:14
If you notice, I have four information contained within this same pattern, something like: "1-bovespa c fractional magaz luiza no eb nm # 1 25.76 25.76 d". So I wanted to get 4 vectors with this information. However, the command provides several other vectors, not related to this pattern. — Matheus Ribeiro, Apr 05 '21 at 16:23

score 1 · Answer 1 · answered Apr 05 '21 at 16:45

Your pattern has parentheses in it - escaped so they are taken literally. Your text does not have parentheses. Also, \d is a special regex to match digits, you want a literal d. I removed the parentheses and the \\, and it seems to work:

out = "q negociação c/v tipo mercado prazo especificação do título obs (*) quantidade preço / ajuste valor operação / ajuste d/c 1-bovespa c fracionario magaz luiza on eb nm # 1 25,76 25,76 d 1-bovespa c fracionario magaz luiza on eb nm # 9 25,76 231,84 d 1-bovespa c fracionario magaz luiza on eb nm 40 25,76 1030,40 d 1-bovespa c fracionario mrv on ed nm 40 18,14 725,60 d resumo dos negócios"
str_extract_all(out, "1-bovespa.+?d")
# [[1]]
# [1] "1-bovespa c fracionario magaz luiza on eb nm # 1 25,76 25,76 d" 
# [2] "1-bovespa c fracionario magaz luiza on eb nm # 9 25,76 231,84 d"
# [3] "1-bovespa c fracionario magaz luiza on eb nm 40 25,76 1030,40 d"
# [4] "1-bovespa c fracionario mrv on ed"

Thanks Gregor, this work very well. I was looking something like that, but I didn't know that the "d" was a special feature for capturing texts. — Matheus Ribeiro, Apr 05 '21 at 16:48
Well, `d` is just a `d`, but `\d` in regex (or `\\d` in R regex) is a special character. `\c, \s, \d, \w, \x, \b`, ... all special meaning. — Gregor Thomas, Apr 05 '21 at 17:04

nycrefugee · Answer 2 · 2021-04-05T17:07:26.310

here's a different approach using the repeated pattern as delimiters. It's a bit hacky, but seems to work:

library(tidyverse)
text <- "q negociação c/v tipo mercado prazo especificação do título obs (*) quantidade preço / ajuste valor operação / ajuste d/c 1-bovespa c fracionario magaz luiza on eb nm # 1 25,76 25,76 d 1-bovespa c fracionario magaz luiza on eb nm # 9 25,76 231,84 d 1-bovespa c fracionario magaz luiza on eb nm 40 25,76 1030,40 d 1-bovespa c fracionario mrv on ed nm 40 18,14 725,60 d resumo dos negócios"


delim1 <- "1-bovespa "
delim2 <- " d"

result <- strsplit(text, delim1) %>% 
  unlist() %>%
  paste0(delim1, .) %>% 
  strsplit(., " d") %>% 
  unlist() %>% 
  enframe(value = "text", name = NULL) %>% 
  slice(2:nrow(.)) %>%
  mutate(text = paste0(text, delim2)) %>% 
  filter(grepl(delim1, text))

With the result:

result
# A tibble: 4 x 1
  text                                                           
  <chr>                                                          
1 1-bovespa c fracionario magaz luiza on eb nm # 1 25,76 25,76 d 
2 1-bovespa c fracionario magaz luiza on eb nm # 9 25,76 231,84 d
3 1-bovespa c fracionario magaz luiza on eb nm 40 25,76 1030,40 d
4 1-bovespa c fracionario mrv on ed nm 40 18,14 725,60 d

Hellow, this code returns a error: "Error in strsplit(text, delim) : objet 'delim' not found" — Matheus Ribeiro, Apr 05 '21 at 16:43
updated above - should have been `paste0(delim1, .)` not `paste0(delim, .)`. Changed another `delim` reference to `delim1` — nycrefugee, Apr 05 '21 at 17:08

Is there an R function to capture a lot of patterns in a text?

2 Answers2