0

I need to match the text inside <p> element excluding the first <strong> element from the text.

When tested on an online regex tester like https://regexr.com/ it works fine, but when I run on my python script, it doesn't matches anything.

Regex:

<br>(.*)<\/p>$

Text:

<p><strong>Ackee and Saltfish Fritters</strong><br>\n2 cup salted Cod fish, soaked overnight<br>\n2 cloves garlic<br>\n½ medium onion<br>\n1 tsp thyme<br>\n1 tbsp cilantro<br>\n1 scallion, finely chopped<br>\n¼ scotch bonnet pepper, seeds removed<br>\n1 cup all purpose flour<br>\n1 tsp baking powder<br>\n½ cup ackee<br>\n¾ cup water<br>\nCanola oil</p>

Python code:

re.search(r"<br>(.*)<\/p>$", target_string)

Desired result in matches group 1:

\n2 cup salted Cod fish, soaked overnight<br>\n2 cloves garlic<br>\n½ medium onion<br>\n1 tsp thyme<br>\n1 tbsp cilantro<br>\n1 scallion, finely chopped<br>\n¼ scotch bonnet pepper, seeds removed<br>\n1 cup all purpose flour<br>\n1 tsp baking powder<br>\n½ cup ackee<br>\n¾ cup water<br>\nCanola oil
Biswajit Chopdar
  • 717
  • 3
  • 10
  • 30

2 Answers2

1

Since you have already isolated the one <p> tag you want, you could just use re.sub here to string off the <strong> tags, e.g.

# -*- coding: utf-8 -*-
import re

txt = "<p><strong>Ackee and Saltfish Fritters</strong><br>\n2 cup salted Cod fish, soaked overnight<br>\n2 cloves garlic<br>\n½ medium onion<br>\n1 tsp thyme<br>\n1 tbsp cilantro<br>\n1 scallion, finely chopped<br>\n¼ scotch bonnet pepper, seeds removed<br>\n1 cup all purpose flour<br>\n1 tsp baking powder<br>\n½ cup ackee<br>\n¾ cup water<br>\nCanola oil</p>"
txt = re.sub(r'<p>(.*)</p>', '\\1', re.sub(r'<strong>.*?</strong>', '', txt), flags=re.DOTALL)
print(txt)

This prints:

<br>
2 cup salted Cod fish, soaked overnight<br>
2 cloves garlic<br>
½ medium onion<br>
1 tsp thyme<br>
1 tbsp cilantro<br>
-1 scallion, finely chopped<br>
¼ scotch bonnet pepper, seeds removed<br>
1 cup all purpose flour<br>
1 tsp baking powder<br>
½ cup ackee<br>
¾ cup water<br>
Canola oil
Tim Biegeleisen
  • 387,723
  • 20
  • 200
  • 263
1

Although Tim's answer solves the problem, I found out what was causing the regex to not work in the my script.

It was the \n character in the text that was causing the issue.

After removing the \n characters, regex worked fine. I think it's because python sees it as new line.

Posting it if anyone faces the same problem:

text = text.replace("\n", "")
lines = re.search("<br>(.*)<\/p>", text).groups(0)[0]
print(lines)

This prints:

2 cup salted Cod fish, soaked overnight<br>2 cloves garlic<br>½ medium onion<br>1 tsp thyme<br>1 tbsp cilantro<br>1 scallion, finely chopped<br>¼ scotch bonnet pepper, seeds removed<br>1 cup all purpose flour<br>1 tsp baking powder<br>½ cup ackee<br>¾ cup water<br>Canola oil
Biswajit Chopdar
  • 717
  • 3
  • 10
  • 30