0

I scraped texts from Wikipedia, now I would like to perform text analysis on them. I'd like to remove all the latex from them.

I have tried some regular expression, but unable to find the one that will do the trick.

Texts that I want to preserve. Remove the messy latex below.

        2


    {\displaystyle 2}
  ⁄

            3


    {\displaystyle {\sqrt {3}}}
  . I want to preserve some texts here: (Similar latex as above)

    2


    {\displaystyle 2}
  ⁄

            3


    {\displaystyle {\sqrt {3}}}

I would expect the result to be all valid texts. In the case above, (Texts that I want to preserve. Remove the messy latex below. I want to preserve some texts here: (Similar latex as above))

rockpang
  • 27
  • 4

1 Answers1

1

With regular expressions, you're going to need a Regex that matches balanced parenthesis { ... }. This is not possible in almost all implementations of Regex, see Regular expression to match balanced parentheses

Instead, you should write a script that reads your file line by line, looks for {\displaystyle and locates the corresponding closing curly brace.

user3187724
  • 182
  • 1
  • 10