0

Im building a scraper for testing purposes (no, im not scraping website news) and I would like to eliminate every useless information from the source, like added links and texts in the bottom of it. Before those data, there is always a static paragraph, so I would like to match this paragraph and remove it, along with everything after it.

The static paragraph is always: <p><strong>Static text here</strong></p>

A sample, complete text that I want to catch using regex is:

<p><strong>Static text here</strong></p>
<p><strong>This is a paragraph</strong></p>
<p>This is another, a normal weight one</p>
<p><img src="test.png">Here's an image</p>

Another example could be:

<p><strong>Static text here</strong></p>
<p><img src="test.png">Here's an image</p>
<p>another example</p>

Any ideas? I generated this regex, but it matches only the first line, whereas I would like to match every possible line after the static text: https://regex101.com/r/xS873u/2

  • 1
    Please use an HTML parser. [Don't use regex to parse HTML](https://stackoverflow.com/a/1732454/4934172). If you're not actually trying to parse HTML, just replace the last `.*` with `[\S\s]*` or use the Single Line flag (i.e., `/gs` instead of `/g`) if that works for you. – 41686d6564 Apr 21 '19 at 13:39
  • Thanks for the useful info @AhmedAbdelhameed - but im restricted to PHP preg_replace function with isu modifier. I tried this one but it doesnt work: https://www.phpliveregex.com/p/rNM#tab-preg-replace – JohnnyBratsoni Apr 22 '19 at 09:40
  • Updated with the correct snippet, for everyone who is interested in this mod. – JohnnyBratsoni Apr 22 '19 at 10:32

0 Answers0