-3

I need a regular expression for extracting the paragraph inside div of class carousel-caption in html string coming from json api in react native app.

var m,
array= [],
str = '
<p>some other text .....  </p>
<div class="carousel-caption d-none d-md-block">\n\n                <p>some text .....  </p></div>
<div class="carousel-caption d-none d-md-block">\n\n                \n            </div>
<div class="carousel-caption d-none d-md-block">\n\n                <p>some text .....  </p></div>
<div class="carousel-caption d-none d-md-block">\n\n                <p>some text .....  </p></div>
<p>some other text .....  </p>';
    rex = /<div [^<>]+carousel-caption[^<>]+>\s*<p>(.+?)<\/p>/g;
    do {
        m = rex.exec(str);
        if (m) {
            console.log(m[1]);
        }
    } while (m);

I have multiple div with classes of name carousel-caption contain single paragraph in each, and i have some paragraphs that are not in class carousel-caption, with rex i can get paragraphs inside carousel-caption div class, however I want the array to have empty field in case the div contains no paragraph while maintaining the order, because i need the caption under its image, and some image do not have caption.

Tarif Aljnidi
  • 198
  • 2
  • 9
  • [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. – Toto May 26 '20 at 13:56

2 Answers2

0

This allows to change the order of the classes, single or double quotes on the class attribute and allows more attributes to the div. The p has to be in the same line and there must not be any parameters in the p tag. Also the p must not contain any line breaks. Between the div tags there has to be a line break.

There are two result groups, the first one is the quote (single quote or double quote, used in the regexp itself), the second one is the text in the p.

<div.*class=("|')(?:\s*(?:carousel-caption|d-none|d-md-block)\s*){3}\1.*>\s*<p>(.*)<\/p>\s*<\/div>

let str = 
  '<p>some other text .....  </p>\n' + 
  '<div class="carousel-caption d-none d-md-block"> <p>1 some text .....  </p></div>\n' + 
  '<div class="carousel-caption d-none d-md-block"> <p>2 some text .....  </p></div>\n' + 
  '<div class="carousel-caption d-none d-md-block"> <p>3 some text .....  </p></div>\n' + 
  '<p>some other text .....  </p>';
const rex = /<div.*class=("|')(?:\s*(?:carousel-caption|d-none|d-md-block)\s*){3}\1.*>\s*<p>(.*)<\/p>\s*<\/div>/g;
let m;

while ((m = rex.exec(str)) !== null) {
  console.log("Found", m[2]);
}

Note that this will also falsely detect

<div class="carousel-caption carousel-caption carousel-caption"> <p>some text .....  </p></div>

If you know for sure, that the format is exactly the one you posted in your question, I suggest to use substring and indexOf.

let str = 
  '<p>some other text .....  </p>\n' + 
  '<div class="carousel-caption d-none d-md-block"> <p>some text .....  </p></div>\n' + 
  '<div class="carousel-caption d-none d-md-block"> <p>some text .....  </p></div>\n' + 
  '<div class="carousel-caption d-none d-md-block"> <p>some text .....  </p></div>\n' + 
  '<p>some other text .....  </p>';
let search = '<div class="carousel-caption d-none d-md-block"> <p>';
let offset = 0;
let pos;

while((pos = str.indexOf(search, offset)) > 0){
  let end = str.indexOf("</p>", pos);
  offset = pos + search.length;
  console.log("Found div at", pos, ", content of p: ", str.substr(offset, end - offset))
}
miile7
  • 1,607
  • 1
  • 11
  • 21
  • 1
    This is working **only** if there are linebreaks between `...` and without linebreaks inside `

    ...

    `. [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239)
    – Toto May 26 '20 at 13:56
0

this assumes there are no child elements in your paragraph...m[1] contains the text of the paragraph tag...

var m, str = `
<p>some other text .....  </p>
<div class="d-none carousel-caption d-md-block">
   <p>some text 1 .....  </p>
</div>
<div class="carousel-caption d-none d-md-block">
   <p> some text 2 .....  </p>
</div>
<div class="carousel-caption d-none d-md-block">
   <p>  some <span>text 3</span> .....  </p>
</div>
<div class="carousel-caption d-none d-md-block">
</div>
<div class="d-none d-md-block">
   <p>oh-no! missing style class</p>
</div>
<p>some other text .....  </p>
`;
matches = str.matchAll(/<div [^<>]+carousel-caption[^<>]+>\s*(?:<p>)?\s*(.*?)\s*(?:<\/p>)?\s*<\/div>/gsi);
for (m of matches) {
  console.log("match: '" + m[1] + "'");
}

generates

match: 'some text 1 .....'
match: 'some text 2 .....'
match: 'some <span>text 3</span> .....'
match: ''

Update: fixed regex to select only paragraphs inside of divs with class=carousel-caption

Update: changed regex to potentially allow for tags inside the paragraphs...except for other p-tags. pls keep in mind regex != html parser and shouldn't be (ab)used as such. this works if the html structure is as defined. if your html can change in any imaginable way, use an html parser instead, a regex oneliner won't do it.

Update: changed regex to also select empty divs with the corresponding style class set.

mrxra
  • 660
  • 4
  • 9