-1

I am writing a function to retrieve a string from html code with Regular Expression.

ex: <p class = 3DFormText> [Telephone] <o: p> </ o: p> <w:sdtPr></w:sdtPr> </p> for this, I want to get [Telephone], so the format for the string I want to receive is [anything]. But I do not know this pattern of method search of Regular Expressions. So anyone please help me create this or give me any suggestions.

Ganesa Vijayakumar
  • 1,942
  • 5
  • 21
  • 36
  • Don't try to parse markup like html with regular expressions . It's not sufficient and ends up helping very little. What you want for parsing html is https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – Daniel Farrell Nov 03 '19 at 16:51
  • just use BeautifulSoup4 (that's the name of the package CASE SENSITIVE) – Ahmed I. Elsayed Nov 04 '19 at 01:56
  • As @DanielFarrell said, it's better to use a HTML/XML parser rather than a regex. You could use [Parsel](https://github.com/scrapy/parsel), [BeautifulSoup](https://pypi.org/project/beautifulsoup4/), etc – reisdev Nov 04 '19 at 01:57
  • Is that actually HTML, or is it XML (e.g. XHTML embedded in an XML-based word processor document of some kind)? – Ry- Nov 04 '19 at 02:04

1 Answers1

0

Better use BeautifulSoup4

you can also run pip install BeautifulSoup4 (case sensitive)

but if you insist, Try improving this pattern, I just made it so it's not 100% perfect of course, and this matches only opening tag

<[A-Za-z]+\s*(\s*[a-zA-Z0-9]\s*=*"*[A-Za-z0-9\(\)]*"*)*>

it matches <tag ANY="ANY" checked> and attributes are optional of course.

it mached my dummy tag

<tag required name1="ahmed" name="mohamed" person="idk" whatever="whatever" checked >

note that I made it accept attributes Capitalized (first letter) just because html accepts them nothing else feel free to remove that if you want.

Ahmed I. Elsayed
  • 1,753
  • 2
  • 10
  • 27