REGEX pattern match in R for Course number

Question

I need to identify matching course number that have xx.3xxxxxx. These are some examples of the course numbers.

I tried many patterns one example I used is the pattern below. It did not get any match.

"[^0-9]{2}\Q.\E3[^0-9]+$"

I tried using grep and grepl. I actually need the code to return indexes.

This code shows my attempt to tag the rows that have matches.

Teacher$virtual[
            which(
                 grepl("[^0-9]{2}\\Q.\\E3[^0-9]+$",Teacher$CourseNumber))]
               <- "1"

I need to remove any row from my dataframe that have the course number with that pattern. XX.3XXXXXX

But, my code did not find any match. Can you please help me?

For that you can try `grepl("^[0-9]{2}\\.3[0-9]+$", Teacher$CourseNumber)` — akrun, Jun 05 '19 at 16:00
Try this with `stringr`: `str_remove_all(Teacher$CourseNumber,"\\.(?=3)")`. — NelsonGon, Jun 05 '19 at 16:06
`grepl("^[0-9]{2}\\.3", Teacher$CourseNumber)` should be enough. If you want to use in-pattern quoting with ``\Q`` and ``\E`` use PCRE regex, add `perl=TRUE`. — Wiktor Stribiżew, Jun 05 '19 at 16:27
I forgot to mention that the data type of the Course number is chr. Will that affect how the pattern should be? — Lilian Tan, Jun 05 '19 at 16:53

score 1 · Answer 1 · answered Jun 05 '19 at 16:05

1

Here, this simple expression would likely cover that:

^[0-9]{2}\.[3].+$

which has a [3] boundary right after the .. It would probably work without start and end anchors:

[0-9]{2}\.[3].+

Demo

We can add or reduce the boundaries, if it'd be necessary.

answered Jun 05 '19 at 16:05

Emma

1
9
28
53

1

Hi Emma, Thank you for your help. I tried your code and got this error.
Teacher2$Virtual3 – Lilian Tan Jun 05 '19 at 17:22
1

your code is very similar to @akrun. He has "^[0-9]{2}\\.3[0-9]+$". I tried his and it worked. – Lilian Tan Jun 05 '19 at 17:26

Wiktor Stribiżew · Accepted Answer · 2019-06-05T18:18:25.223

1

You should use

grepl("^[0-9]{2}\\.3", Teacher$CourseNumber)

See the regex graph:

Details:

^ - start of a string
[0-9]{2} - two digits
\\. - a dot (note that a regex escape is a literal backslash, but inside a string literal, "...", a single backslash is used to form string escape sequences, hence the backslash must be double to obtain a literal backslash char necessary for a regex escape)
3 - a 3 char.

NOTE: If you want to use in-pattern quoting with \Q and \E (in between which all chars are treated literally) you need to use PCRE regex, add perl=TRUE and use

grepl("^[0-9]{2}\\Q.\\E3", Teacher$CourseNumber, perl=TRUE)

Now, the dot is treated as a literal dot, not a . metacharacter that matches any char but a line break char (in a PCRE regex, . does not match line break chars by default).

edited Jun 05 '19 at 18:18

answered Jun 05 '19 at 18:14

Wiktor Stribiżew

484,719
26
302
397

the `perl=TRUE` option should be the default, IIRC it is both more efficient and more featureful – Rorschach Jun 05 '19 at 18:16
@jenesaisquoi I did not test the latest versions, but some time ago, our tests at SO showed that PCRE regex is faster in Linux and MacOS, but the default TRE is faster in Windows. Also, speaking about features, TRE, being a text-directed engine, picks the longest alternative from a group, and might be preferred in some cases. See [TRE vs. PCRE comparison](https://stackoverflow.com/questions/47240375/regular-expressions-in-base-r-perl-true-vs-the-default-pcre-vs-tre/47251004#47251004). Actually, TRE supports fuzzy matching, PCRE does not. – Wiktor Stribiżew Jun 05 '19 at 18:22
there is some discussion of the R implementations buried in the bowels of the R documentation where they talk about performance. I'm sure you're right that there are some cases where it's preferred – Rorschach Jun 05 '19 at 18:24

REGEX pattern match in R for Course number

2 Answers2

Demo