How to remove all the latex from a wikipedia text?

Question

I scraped texts from Wikipedia, now I would like to perform text analysis on them. I'd like to remove all the latex from them.

I have tried some regular expression, but unable to find the one that will do the trick.

Texts that I want to preserve. Remove the messy latex below.

        2


    {\displaystyle 2}
  ⁄

            3


    {\displaystyle {\sqrt {3}}}
  . I want to preserve some texts here: (Similar latex as above)

    2


    {\displaystyle 2}
  ⁄

            3


    {\displaystyle {\sqrt {3}}}

I would expect the result to be all valid texts. In the case above, (Texts that I want to preserve. Remove the messy latex below. I want to preserve some texts here: (Similar latex as above))

score 1 · Answer 1 · answered Nov 11 '19 at 11:25

With regular expressions, you're going to need a Regex that matches balanced parenthesis { ... }. This is not possible in almost all implementations of Regex, see Regular expression to match balanced parentheses

Instead, you should write a script that reads your file line by line, looks for {\displaystyle and locates the corresponding closing curly brace.

How to remove all the latex from a wikipedia text?

1 Answers1