I am trying to get the element and class name for all elements within an html file using python. I managed to get all class names with the code below. It's written like that because I will go through a lot of html files while storing elements with their class names. Ignoring elements without a class name.
temp_file = open(root + "/" + file, "r", encoding="utf-8-sig", errors="ignore")
temp_content = temp_file.read()
class_names = re.findall("class=\"(.*?)\"", temp_content)
However now I am struggling to find a way to get the element that the class belongs to. Keep in mind that elements sometimes overlap with each other, so readlines() won't help too much either and it would proabably be slower than regexing the entire document at once.
<div class="header_container container_12">
<div class="grid_5">
<h1><a href="#">Logo Text Here</a></h1>
</div>
<div class="grid_7">
<div class="menu_items">
<a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a>
<a href="#"
class="about">About Me
</a><a href="#" class="contact">Contact Me</a>
</div>
</div>
</div>
The above html snippet is badly indented on purpose, to showcase the kind of data I am working with... The goal would be to maybe store them in a hashmap. i.e.
"header_Container container_12": "div"
"grid_5": "div"
"grid_7": "div"
"menu_items": "div"
"home active": "a"
"portfolio": "a"
"about": "a"
"contact": "a"