0

I have an HTML file with vendor information that I want to scrape. The pictures just an example for the first page.

I noticed that the element isn't in the same format.

I expect all the name is under "c x27 y3b w3b hb"

<div class="c x27 y3b w3b hb">
<div class="t m0 x8 hd ya ff9 fs4 fc1 sc0 ls0 ws0">Vendor 1</div></div>

But it's not. The next vendor name is in "c x27 y53 w3b hb", instead of "c x27 y3b w3b hb"

<div class="c x27 y53 w3b hb">
<div class="t m0 x8 hd ya ff9 fs4 fc1 sc0 ls0 ws0">Vendor 2</div></div>

How I could scrape the information if the HTML element isn't organized.

baduker
  • 12,203
  • 9
  • 22
  • 39
Oliver Bird
  • 115
  • 1
  • 11
  • 1
    Uh, Is this real PII data you're posting on a public website? – Blorgbeard Apr 01 '19 at 21:16
  • I'm not sure but I believe these are the information user could find online. Just in case, I removed the pic. – Oliver Bird Apr 01 '19 at 21:19
  • 1
    what is `c x27 y3b w3b hb` ? Is it CSS class or what ? – furas Apr 01 '19 at 21:26
  • if HTML is unorganized then you may have use many `if/else` or try to find some other method to recognize item - ie. `first SPAN in second TR in fifth DIV`, or you can use `regex`. I'm not sure but probably BS has function to search with `regex`. Or maybe you should search only `c x27 w3b hb` without `y3b` or `y53` - if it is CSS classes then it should be possible to search only `c x27 w3b hb` – furas Apr 01 '19 at 21:27
  • @furas It's the value of the `class` attribute. – Blorgbeard Apr 01 '19 at 21:30
  • 1
    So the elements all have multiple classes, and they don't share the exact same set.. is there a set of classes that they do all share and that no other elements also share? – Blorgbeard Apr 01 '19 at 21:31
  • 2
    See: https://stackoverflow.com/a/22284921 - you could do something like `find_all("div", class_ = "c x27 w3b hb")`, just excluding `y3b`/`y53`. – Blorgbeard Apr 01 '19 at 21:34
  • As I see all vendors are in `t m0 x8 hd ya ff9 fs4 fc1 sc0 ls0 ws0` but if it is not enought then all vendors are in `c x27 w3b hb` - it is not one class but many classes (in one string) and you don't have to search all classes. You can also find element with `c x27 w3b hb` and inside this element search `t m0 x8 hd ya ff9 fs4 fc1 sc0 ls0 ws0` or `div`. With BS you can build complex rules. – furas Apr 01 '19 at 21:42
  • @furas I got you you meant, could you provide the key words that show me to build complex rules in BS. Or it would be the one Blorhbeard provided. – Oliver Bird Apr 01 '19 at 21:51
  • For example, I use an attribute from a div class to find Cell. But that attribute also links to another information. Any approach to solve that? – Oliver Bird Apr 01 '19 at 22:07
  • @OliverBird example @Blorbeard shoud resolve your problem - but if you will have more complex problem then you can use many times `find()` and `find_all()` . First `find_all()` to find items with classes `c x27 w3b hb` and you get list of items. Now use `for` loop to use `find_all()` or `find()` with every item on list. Etc. You can also use `find_all()` after `find() - `soup.find(..).find_all(..)` or `soup.find(..).find(..).find_all()`. You can also use regex `find_all(re.compile("^b"))` or `.children`, etc. - see [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) – furas Apr 01 '19 at 22:08
  • @furas Thank you for the direction. I know what I can do next. – Oliver Bird Apr 01 '19 at 22:13
  • 1
    you don't have use only one `find_all()` to find cell - first you can use `find_all()` (or `find()`) to find its parent and later search your cell inside parent. Some element may have attribute `id` or `data` which helps to find it. – furas Apr 01 '19 at 22:14
  • 2
    BTW: sometimes you can find many items but your element is always as third on this list so you can use index - `find_all(..)[3]` – furas Apr 01 '19 at 22:19
  • It would help to see the whole HTML file. It would then be easier to spot the actual structure. You could use something like pastebin to post a link to the file. – Martin Evans Apr 04 '19 at 08:59

0 Answers0