Regex to capture html elements with their class name

Question

I am trying to get the element and class name for all elements within an html file using python. I managed to get all class names with the code below. It's written like that because I will go through a lot of html files while storing elements with their class names. Ignoring elements without a class name.

 temp_file = open(root + "/" + file, "r", encoding="utf-8-sig", errors="ignore")
    temp_content = temp_file.read()
    class_names = re.findall("class=\"(.*?)\"", temp_content)

However now I am struggling to find a way to get the element that the class belongs to. Keep in mind that elements sometimes overlap with each other, so readlines() won't help too much either and it would proabably be slower than regexing the entire document at once.

<div class="header_container container_12">
        <div class="grid_5">
              <h1><a href="#">Logo Text Here</a></h1>
        </div>
        <div class="grid_7">
            <div class="menu_items"> 
                <a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a> 
               <a href="#" 
                class="about">About Me
                </a><a href="#" class="contact">Contact Me</a> 
            </div>
        </div>
</div>

The above html snippet is badly indented on purpose, to showcase the kind of data I am working with... The goal would be to maybe store them in a hashmap. i.e.

"header_Container container_12": "div"
 "grid_5": "div"
 "grid_7": "div"
 "menu_items": "div"
 "home active": "a"
 "portfolio": "a"
 "about": "a"
 "contact": "a"

[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. — Toto, Feb 19 '20 at 17:19

ggorlen · Accepted Answer · 2020-02-19T16:59:34.597

Regex is a poor choice for HTML parsing, but luckily this is trivial with BeautifulSoup:

from bs4 import BeautifulSoup

html = """<div class="header_container container_12">
        <div class="grid_5">
              <h1><a href="#">Logo Text Here</a></h1>
        </div>
        <div class="grid_7">
            <div class="menu_items"> 
                <a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a> 
               <a href="#" 
                class="about">About Me
                </a><a href="#" class="contact">Contact Me</a> 
            </div>
        </div>
</div>"""

for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
    print(elem.attrs["class"], elem.name)

Output:

['header_container', 'container_12'] div
['grid_5'] div
['grid_7'] div
['menu_items'] div
['home', 'active'] a
['portfolio'] a
['about'] a
['contact'] a

You can put this into a dict as you desire, but be careful since more than one element will likely map to each bucket. All it'd tell you is that an element exists and has a certain tag name given a specific class name string or tuple in a specific order.

elems = {}

for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
    elems[tuple(elem.attrs["class"])] = elem.name

for k, v in elems.items():
    print(k, v)

Thank you for the solution, had to switch to "html.parser" as it was failing to find "lxml". As a side question, would there be an easy way to get what elements are under what using beautiful soup? So in the html above get the "header_container" as the parent of all other elements and "menu_items" parent of the "a" elements. — Just_A.Technicality, Feb 19 '20 at 18:33
Sure, see [finding elements by class name](https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class) and [finding children of a node](https://stackoverflow.com/a/15892793/6243352). — ggorlen, Feb 19 '20 at 18:42

score 0 · Answer 2 · answered Feb 19 '20 at 16:53

I think regex is the wrong tool for the job here, consider loading your HTML into a DOM document and parsing it using DOM selectors instead.

The following example is javascript, because it will allow me to include it as a runnable snippet - but it should explain the approach enough for you to create the python equivalent.

var classElements = document.querySelectorAll("[class]");

for(i = 0; i < classElements.length; i++)
{
 console.log(classElements[i].className + ": " + classElements[i].tagName);
}

<div class="header_container container_12">
        <div class="grid_5">
              <h1><a href="#">Logo Text Here</a></h1>
        </div>
        <div class="grid_7">
            <div class="menu_items"> 
                <a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a> 
               <a href="#" 
                class="about">About Me
                </a><a href="#" class="contact">Contact Me</a> 
        </div>
</div>

Regex to capture html elements with their class name

2 Answers2