I am doing a project on web-page classification, for which I am interested in extracting data of visible text, images and svg's (including elements outside the viewport). I am having trouble accurately determining this.
I have checked in all corners of the internet for potential causes but unfortunately have not been succesful.
My current code looks like this:
var isHidden = el => {
return Object.values(potentialHiddenCauses(el)).filter(x => x).length > 0;
}
var potentialHiddenCauses = el => {
var style = window.getComputedStyle(el);
var rect = el.getBoundingClientRect();
var hasNodeWithVisibleText = Array.from(el.childNodes).filter(x => x.nodeType == 3 && x.nodeValue.replace(/\s/g, "").length > 0).length > 0;
var data = {
// a : !el.offsetParent, // can be a false positive?
b : style.display == "none",
c : style.opacity == 0,
d : style.visibility == "hidden",
e: el.offsetWidth == 0,
f: el.offsetHeight == 0,
i : rect.width == 0,
j : rect.height == 0,
k : rect.x < 0,
l: rect.x > document.documentElement.scrollWidth,
m : rect.y < 0,
n: rect.y > document.documentElement.scrollHeight,
o : hasNodeWithVisibleText && style.fontSize == "0px",
p : el.tagName.toLowerCase() == "img" && !el.src
}
return data;
}
I previously also checked element.style properties but to my understanding they may not be accurate while getComputedStyle should be.
Every time I think my check is solid, there is an edge-case, causing me to have to restart the entire data collection process.
My latest problem is on https://www.jbhifi.com.au/bose/bose-quietcomfort-35-ii-wireless-over-ear-headphones-black/505852/, where texts and images in the dropdown menu are considered visible when they are not displayed.
It would be great if someone could tell me about missing checks or flaws in my approach.