What are the downsides of using a simple regex-based markdown parser?

Question

I require a relatively simple markdown parser for my application. Just simple stuff like bolding, italics, etc. I was looking around for libraries and many seem to be quite large. For example, marked is quite popular with 20,000 stars. And it's close to 2,000 lines of code. I'm not even sure how large this one is, but it seems quite complex.

Generally I try to prefer to keep things simple and limit my dependencies whenever possible. I'm not exactly sure what all those lines are doing? I was pleased to soon after find this library which isn't even 100 lines, and it just uses a simple regex to transform text into its corresponding markdown.

My question is, basically, what are those other libraries doing? Am I missing something by opting to use a simpler, regex-focused approach? Is the latter library not safe in some way? Should I be considering some other factor that I am ignorant of?

Clearly there seems to be something of importance that I am missing, because the former libraries seem quite popular, and the latter one has not even a single star. I'm just not sure what that is. I'm hoping that the case is that the latter is fine for simple cases, and the former ones are more "complete" if that's what you need, but I don't want to jump to that conclusion.

The small library does not parse true Markdown, only a small custom version of it created by the same person who created the library. JS regular expressions can't parse recursive patterns by themselves - true Markdown allows for nested blockquotes and lists, for instance, which a simple regex alone wouldn't be sufficient for. — CertainPerformance, Sep 04 '19 at 05:04
@CertainPerformance: Ah, thank you for the reply! Hmm, I think I may be fine with non-recursive patterns for my application. Is that the sole reason or is there something else? If that is the main factor then I may just go with the latter solution. — Ryan Peschel, Sep 04 '19 at 05:05
Not sure. Maybe a true tokenizer (which requires a not-insignificant bit of code) is considered more reliable, but it may not be necessary if the input will not have any problematic nesting. — CertainPerformance, Sep 04 '19 at 05:09
@CertainPerformance: Cool, thank you for your replies! Looks like this is going to get closed soon so glad to have gotten the knowledge before then! :p — Ryan Peschel, Sep 04 '19 at 05:10
I don't agree with the close vote reason, this is not a library request - someone with the right expertise could probably provide a good answer. — CertainPerformance, Sep 04 '19 at 05:12
Well if you look at source code of the one you're saying smallest, it doesn't cover all the cases, if you see the first rule `(\*)(.*)\1` and it replaces `$2`, but what if i string have `\*hello\*` in actual markdown this should not be replaced with tag, but this library will change it to ``, [`Demo`](https://regex101.com/r/TokdcV/1/) , you can test [`Markdown output here`](https://stackedit.io/app#) — Code Maniac, Sep 04 '19 at 05:48
@CodeManiac: Ah, that is a bit more of a problem, yes. I believe I'd like to maintain the ability to escape this pseudo-markdown. Is there a way to maintain that property while still using simple regexes? — Ryan Peschel, Sep 04 '19 at 05:55
@RyanPeschel if you know the standard format what your string can have, you write them by yourself by covering all the possible cases can occur in your use cases, or simply extends these regex as per your needs, but if you have lots of cases to cover go for a well tested library — Code Maniac, Sep 04 '19 at 05:57
@CodeManiac: Is there a better way than something like this? `text.replace(/([^\\])(\*)(.*)([^\\])(\*)/g, '$1$3$4');` — Ryan Peschel, Sep 04 '19 at 06:03
@RyanPeschel if i had to do it i will list down all the cases my string can be, once i am sure of all the possible cases i will start tackling them one by one without thinking much of optimization, once i cover all the cases i will try to remove all the redundant code. — Code Maniac, Sep 04 '19 at 06:04
@RyanPeschel in markdown `*hello*` should be replaced with italics, but this regex will replace with ``, so you need to cover all those cases where `***` will be replace by italics and bold, `**` will be replaced by `bold`, `*` will be replaced by `italics` — Code Maniac, Sep 04 '19 at 06:06
@CodeManiac: Hmm, yeah, I'm beginning to understand why these markdown parsers are long. Escaping + nesting makes things awfully complex. I'm currently looking online and it looks like most regex markdown parsers don't handle escaping at all. Trying to think about whether or not I even want to include the functionality if it'll be too confusing. — Ryan Peschel, Sep 04 '19 at 06:12

Waylan · Accepted Answer · 2019-09-04T23:58:48.997

There are a number of factors which contribute to the complexity of Markdown parsers. That said, you can use a "simple regex-based" method to build a Markdown parser. In fact, this is exactly what the reference implementation uses (in Perl). It runs a series of regular expressions which replace the Markdown syntax with HTML syntax on the existing document. Even then, the source code is comprised of 1451 lines of code, including comments, license, etc. Of course, it includes support for the entire list of features described in the original syntax rules. Those features include things like support for nesting, escaping and the like, which significantly complicate the use of regex.

Some people find such an implementation limiting. It all depends on what you want out of a Markdown parser.

For example, extending the syntax is near impossible with the reference implementation. As an example, Python-Markdown (of which I am a developer) has taken the reference implementation, given each regex a name and provided a way for third-party extensions to replace or insert new regular expressions into the mix. The boilerplate code just to allow this adds considerably more lines of code. By the way, Markdown is old and libs such as Python-Markdown have changed and grown over the years. The first version very closely mimicked the reference implementation, but today you would be hard pressed to see any similarities between them.

Others aren't interested in extending the syntax so much as offering a way to control the output. For example, the marked JS library outputs an abstract syntax tree (AST), which can then be passed to a renderer. Renderers accept the AST (basically a list of tokens) and output some other format. That other format could be HTML, or it could be something else. Pandoc takes advantage of this to convert to and from many document formats. Naturally, this adds additional lines of code.

An additional factor, regardless if implementation, is that many would argue that if an implementation doesn't support all of the features in the rules, then it is not Markdown. In fact, over the years, many implementation have added non-standard features (see GitHub Flavored Markdown as an example). People begin to rely on these non-standard features and will file bug reports complaining that an implementation does not support them. As the developer of Python-Markdown, I regularly see such reports when the lib does in fact offer support. It just isn't enabled by default. When this is pointed out to them, their reaction is often less that understanding. So no implementation made for general consumption will last long without support for all standard features.

Adding additional complication is that there is not perfect agreement among implementations regarding the standard features. See the Babelmark 2 FAQ for details. In that FAQ you will find a lot of documented differences which are rather nuanced. People really find these minor differences important. For that reason, a group of people created Commonmark, a strict specification for Markdown. However, as Commonmark has never received the blessing of the creator of Markdown, some question whether it can be considered Markdown at all. Additionally, in some places the spec, by its own admition, is in direct violation of the original rules. Regardless, for an implementation to be a Commonmark implementation is must provide a complete solution with all of the documented features of the spec. The reference implementations (in JS and C) are both quite large. In fact, I doubt you could implement Commonmark with an implementation which used simple rexed based replacement like markdown.pl does.

The point is that with all but the most simple implementations, you are getting more than simply a collection of regex substitutions. The exact features differ from implementation to implementation and would require a careful reading of the documentation for each. Regardless, even a "simple" collection of regex substitutions is rather quite complex and lengthy to implement all of the documented features of Markdown. Anything less would not be considered Markdown.

Another consideration is performance. While the regex based parsers are 'good enough' for most general use (running from the command line as the reference implementation was designed for), the more performant implementations (such as marked or the Commonmark reference implementation) produce an AST and use a renderer. A regex based implementation will never come close to matching that in performance, which is important if your web server is converting Markdown to HTML on each request.

What are the downsides of using a simple regex-based markdown parser?

1 Answers1