2

I'm doing a project in my spare time where I have hit a problem with getting data from a webpage into the program.

This is my current code:

import urllib
import re

htmlfile = urllib.urlopen("http://www.superliga.dk/klub/aab?sub=squad")

htmltext = htmlfile.read()

regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'

pattern = re.compile(regex)

goal = re.findall(pattern,htmltext)

print goal

And it's working okay except this part:

regex = r'<div data-reactid=".3.$squad content.0.$=11:0.0.0.0.1:0.2.0.0">([^<]*)</div>'

I can't make it display all values on the webpage with this reactid, and I can't find any solution to this problem. Any suggestions to how I can get Python to print it?

OneCricketeer
  • 126,858
  • 14
  • 92
  • 185
MaltheB
  • 21
  • 1
  • Have you tried an actual html parser? – OneCricketeer Oct 07 '16 at 13:40
  • What cricket said above; this would be 100x easier with some kind of parser or scraper. Check out [this link](http://docs.python-guide.org/en/latest/scenarios/scrape/) for an example. – Dillanm Oct 07 '16 at 13:42

1 Answers1

1

You are trying to match a tag you saw on the on the developer console of you browser, right? Unfortunately the html you saw is only the "final form" of a dynamic page: what you did download with urlopen is only the skeleton of the webpage, which in the browser is then dynamically filled with other elements by the javascript using data fetched from some backend server.

If you try to print the actual value stored in htmltest you will find nothing like what you are trying to match with the regex, and that's because it missed all the further processing normally performed by by the javascript.

What you can try to do is to monitor (through the dev console) the fetched resources and reverse-engineer the API call in order to recover the desired info. Chances are the response of these API call is in JSON format or has a structure way more easily parsable than the html body.

UPDATE: for example, in Chrome's dev tools I can see async calls like:

http://ss2.tjekscores.dk/pro-stats/tournaments/46/top-players?sortBy=eventsStats.goals&limit=5&skip=0&positionId=&q=&seasonId=10392&teamId[]=8470

Maybe this returns the info you are looking for.

Cavaz
  • 2,682
  • 19
  • 33
  • That you so much for the feedback, I will look into that! I'm new at Python, so I don't really know what I'm doing but I'm getting there hopefully! :D – MaltheB Oct 07 '16 at 19:50
  • I'm also using Sublime text 3 at the moment for programming, but I would really like another software for Windows 10, if you have any suggestions! :) – MaltheB Oct 07 '16 at 19:54
  • we are getting OT so I'll point you to the related question: http://stackoverflow.com/a/81609/1029516. Still, if I'm allowed to express a personal preference without opening a flame war, I'd say "PyCharm" and "don't underestimate Sublime Text"! – Cavaz Oct 07 '16 at 20:22
  • I'm just having problem with Sublime text where I have no packages, which means I have problem with alot of diffrent libaries. Which makes programing a painin the bum. – MaltheB Oct 07 '16 at 20:48