-2

I need to check a webpage search results and compare them to user input.

ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty

since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT so how to do it? maybe using regex? the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=

  • The server claims the codec is UTF-8 instead. – Martijn Pieters Feb 03 '14 at 14:31
  • It is unclear what you are trying to do; the page loads just fine, uses UTF-8, BeautifulSoup can parse it just fine and as usual all text results from searches in the soup are Unicode objects. Are you working on Windows and trying to *print* the results perhaps? – Martijn Pieters Feb 03 '14 at 14:33
  • If you are trying to print the results to your console and getting an error **encoding** the unicode, then see [Python, Unicode, and the Windows console](http://stackoverflow.com/q/5419) – Martijn Pieters Feb 03 '14 at 14:33
  • @martijnPieters yes, I'm trying to print them. But I don't know what to do with the printed results. my task is to check if, for example, professor Niels Bohr shows up in search results. Can you give me code example? – user3263951 Feb 03 '14 at 14:44
  • Can you update your question to clarify that? You appear to have two issues, and you need to keep those separate. One is that printing Unicode to the windows console is problematic (see the other post) the other is that you have a set of pages you need to parse and you don't know how to. – Martijn Pieters Feb 03 '14 at 14:46
  • Do you know the structure of the page? What are your intentions? Just check for the presence? Get more information about the professor? – Rod Feb 03 '14 at 14:49
  • @martijnPieters updated – user3263951 Feb 03 '14 at 15:03
  • @user3263951: you are printing the Unicode value to the console, which is what gives you your encoding exception. I've posted solution below. – Martijn Pieters Feb 03 '14 at 15:06

1 Answers1

0

Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:

import sys
import urllib2

from bs4 import BeautifulSoup

url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'

page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

for link in soup.find_all(class_='AllWordsTextHit', text=name):
    print link

This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.

The link object has a (relative) href attribute you can then use to load the next page:

professor_page = 'http://www.enciklopedija.hr/' + link['href']
Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
  • thank you! however, this only works for names without special characters like čć. i tried doing the search with u"Aganović, Ibrahim" and i get an error: "Encoding file module.py using "ascii" encoding will reuslt in information loss. do you want to continue?" – user3263951 Feb 03 '14 at 20:53
  • @user3263951: Works for me; are you certain that your source code encoding is correct? See http://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code – Martijn Pieters Feb 03 '14 at 20:55
  • what do you think my source encoding should be? I'm still not getting the result. – user3263951 Feb 04 '14 at 16:43
  • @user3263951: I have *no* idea what editor you are using, platform, etc. That's something you'll have to figure out on your own. – Martijn Pieters Feb 04 '14 at 16:44