-2

I am looking to extract data from all the "Reactions" in the webpage, http://www.genome.jp/dbget-bin/www_bget?cpd:C10453

The code when executed should get data from fields Name, formula, reaction, pathway. Next it should open all the 3 reactions and collect the data of fields Name, definition, reaction class.

I tried using Beautiful soup but did not get how to extract data as there is no specific class for the fields in HTML.

  • Welcome to Stack Overflow! Please update your question to show what you have already tried in a [minimal, complete, and verifiable example](https://meta.stackoverflow.com/questions/261592) and add sample input and expected output. For further information, please see [how to ask good questions](https://stackoverflow.com/help/how-to-ask), and take the [tour of the site](https://stackoverflow.com/tour) :) – Gilles Quenot Mar 26 '18 at 21:28

1 Answers1

0

I assume you have inspected the element on the webpage and noticed, that the reaction table row has the class 21. Assuming that every page is structured like this and you use BS3 or BS4, you should be able do something like

// get all elements with class td21, take the first, take every link in it
links = soup.find_all("td", class="td21"})[0].find_all("a")

to get the link elements (warning, syntax varies between BS3 + BS4!). Have a look in the references for further informations

With the links you got, you can start new http requests by extracting the href-attribute for each link and start parsing your results again with BS.


References:

how-to-find-elements-by-class

searching-by-css-class

nasskalte.juni
  • 353
  • 2
  • 11
  • The assumption is wrong. Other web pages have class td20. Is it possible to extract a class name based on the value. For instance "Reaction" class which is th21 in the webpage http://www.genome.jp/dbget-bin/www_bget?cpd:C10453 – Hemachandra Ghanta Mar 30 '18 at 22:27
  • Seems to make it a bit more complicated. In this case, I would assume that 'Reactions' is always somewhere between td15 and td25 and .strip() and check the contents of these elements (.text), if they match 'Reaction' (there musst be at least this word or a list of unique words that identify the row), get the next one with the .find_next_sibling() or .find_next() method (or attribute), assuming that the links are always next to the Reaction title. – nasskalte.juni Apr 02 '18 at 18:58