Parsing bullets containing newlines from plain text

Question

I am trying to parse a text document containing multiple bullets.

I would like to parse a bullet point having single newline characters but would like to break when 2 or more newline characters are found.

for example :
-----------------------------------
* bullet
text on new line
more text

this should be a separate block
-----------------------------------

when passed through the function, this should capture :
-----------------------------------
-> start
bullet 
text on new line 
more text
<- end capture

this should be a seperate block
-----------------------------------

This is what i have so far , I have written a javascript function that can recursively parse ordered/unordered mediawiki'sh lists to html. Only difference is that the blocks are onserted on 2 line breaks vs mediawiki way of 1 line breaks.

function parseLists(str)
{
//How can I capture bulleted lines with less than or equal to "1" newline character? 
    return str.replace(/(?:(?:(?:^|\n)[\*#].*)+)/g, function (match) {
        var listType = match.match(/(^|\n)#/) ? 'ol' : 'ul';
        match = match.replace(/(^|\n)[\*#][ ]{0,1}/g, "$1");
        match = parseLists(match);
        return '<'
                + listType + '><li>'
                + match.replace(/^\n/, '').split(/\n/).join('</li><li>')
                + '</li></' + listType
                + '>';
    });
}

http://jsfiddle.net/epinapala/L18y7zyx/7/

I think the problem is with the first regex - /(?:(?:(?:^|\n)[*#].*)+)/g to match bullts, this regex actually breaks when a newline character is found, How can I capture bulleted lines with less than or equal to "1" newline character?

I would like to parse the bullets with newlines in them, and would like to break a bullet only if there are 2 or more new line characters. followed by bullet content.

[Edit] - I was able to make some changes and the current version of my function looks like below

function parseLists2(str)
{
  return str.replace(/(?:(?:(?:^|\n)[\*#](?:.+\n)+.*))/g, function(match){ 
      match = match.replace(/\n(?![#\*])/g," ");
        //alert(match);
        var listType = match.match(/(^|\s)#/) ? 'ol' : 'ul';
        match = match.replace(/(^|\s)[\*#][ ]{0,1}/g, "$1");
        match = parseLists2(match);
        return '<'
                + listType + '><li>'
                + match.replace(/^\s/, '')
                .split(/\n/).join('</li><li>')
                + '</li></' + listType
                + '>';
    });
}

The only problem I am facing is If I have a pattern like below:

* some ul item
* some ul item 
# some ol item

the ul item is not being seperated as a block unless it is seperated by a double line break.

Thanks!

Its just the same markup repeated if you want to make the test text bigger. All I am trying to extract is each bullet point unless separated by a two or more new line characters. Problem now is that even one newline character is being parsed as a new block of text altogether. — Eswar Rajesh Pinapala, Dec 15 '14 at 02:11
I suspected something is wrong with recursive regex, so I came up with this example: http://pastebin.com/RkGj3h4v — Ming-Tang, Dec 15 '14 at 02:16
Though it has nothing to do with that I am tryign to solve, I updated the code to fix another problem with recursive regex — Eswar Rajesh Pinapala, Dec 15 '14 at 02:28
This is media wiki markup, a bit customized. I have given an example above. — Eswar Rajesh Pinapala, Dec 15 '14 at 02:34
I think the probelm is with /(?:(?:(?:^|\n)[\*#].*)+)/g I need to change this regex to capture bullets with 0 or 1 newline characters but break on 2 or more new line characters. — Eswar Rajesh Pinapala, Dec 15 '14 at 02:36
Javascript sucks with multiline regexp. either break out the text using split and some manual processing or try ypu luck with `/[\s\S]/` http://stackoverflow.com/a/1068308/227176 — Sukima, Dec 15 '14 at 02:55
I tried doing (?:(?:(?:^|\n)[\*#][\s\S]*)+), but that seem to select all the lines. I would Like to limit the capture until double line break is encountered. Can you explain more about splitting the text? — Eswar Rajesh Pinapala, Dec 15 '14 at 03:08
I think one better looks to the *MediaWiki* source code or writes a context free grammar instead of a regex... — Willem Van Onsem, Dec 15 '14 at 19:40
You can loop line-by-line and just remember the state you are in before ([like in my markdown parser](https://github.com/bjb568/Markdown-HTML/blob/master/core.js#L126)). — bjb568, Dec 16 '14 at 02:31
I believe iterating line-by-line would be slower than regex. Moreover, I have created the rest of the parser(for other markup syntax) using regex in a similar way. Only this is driving me nuts! :( — Eswar Rajesh Pinapala, Dec 16 '14 at 04:48

Witiko · Accepted Answer · 2014-12-17T17:16:44.127

You can first create lists and the <li>s for your bullets using these two (1, 2) regexs:

/\*\s*(([^\n]*(\n|$))*?)(?=\n|#|\*|<[uo]l>|$)/g;
 /#\s*(([^\n]*(\n|$))*?)(?=\n|#|\*|<[uo]l>|$)/g;

You can then join adjacent <ul>s and <ol>s using another regex:

/(<\/ul>\n?<ul>|<\/ol>\n?<ol>)/g;

Example

The following snippet demonstrates this:

txt1.onkeyup = txt1.onkeydown = txt1.onchange = replace;
replace();
  
function replace() {
  txt2.innerHTML = txt1.value.
    replace (/\*\s*(([^\n]*(\n|$))*?)(?=\n|#|\*|<[uo]l>|$)/g, "<ul><li>\n$1</li></ul>").
    replace ( /#\s*(([^\n]*(\n|$))*?)(?=\n|#|\*|<[uo]l>|$)/g, "<ol><li>\n$1</li></ol>").
    replace (/(<\/ul>\n?<ul>|<\/ol>\n?<ol>)/g, "");
}

#txt1, #txt2 {
  width: 40%;
  height: 150px;
  display: inline-block;
  overflow-y: scroll;
}

<textarea id="txt1">
* aaaa
* bbbb
# cccc
# dddd

This text is separate.
</textarea><div id="txt2"></div>

Parsing bullets containing newlines from plain text

1 Answers1

Example