0

I am using python 2.7.8. I dont know what really happend. Everything was going well but suddenly this error appeared. I really dont understand what is this. Search a lot but fail to resolve.

the full Error is:

IOError: [Errno 2] The system cannot find the path specified: '\\settings\\ads\\preferences?hl=en'

here is my code:

#!/usr/bin/env python
import re
import requests

import urllib
from bs4 import BeautifulSoup

def addtoindex(self, url, soup):
        if self.isindexed (url): return
        print 'Indexing ' + url
        # Get the individual words
        text = self.getTtextonly(url)
        #print 't',text
        words = self.separatewords(text)
        #print 'words',words
        if stem: words = pracstem.stem(words)
        # Get the URL id
        urlid = self.getentryid('googleurllist', 'url', url)
        #print 'id',urlid
        # Link each word to this url
        for i in range(len(words)):
            word = words[i]
           # print 'w',word
            if word in ignorewords: continue
            wordid = self.getentryid('googlewordlist', 'word', word)
            #print 'wordid',wordid

            self.con.execute("insert into googlewordlocation(urlid, wordid, location) values('{0}', '{1}', '{2}')" .format(urlid, wordid, i))
            self.con.commit()


def getTtextonly(self, soup):
        url = soup
        #url = "http://www.cplusplus.com/doc/tutorial/program_structure/"
        html = urllib.urlopen(url).read() # compiler pointing error here
        soup = BeautifulSoup(html)

        # kill all script and style elements
        for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
            script.extract()    # rip it out

        text = soup.findAll(text=True)
        return text

 def findfromGoogle(self,a):

    page = requests.get("https://www.google.com/search?q="+a)
    soup = BeautifulSoup(page.content)
    links = soup.findAll("a")
    for link in  links:
        if link['href'].startswith('/url?q=') \
        and 'webcache.googleusercontent.com' not in link['href']:
            q = link['href'].split('/url?q=')[1].split('&')[0]
            #self.con.execute("insert into wordlocation(urlid, wordid, location) values(%i, %i, %i)" %(urlid, wordid, i))
           # self.con.execute("insert into googleurllist (keyword,url,relevance,textcomplexity)VALUES('{0}','{1}','{2}','{3}')" .format(a,q,'',''))
           # linkText = self.gettextonly(q)
            #self.con.commit()
            print "Records created successfully";
            print q
            self.addtoindex(q,soup)
            linkText = self.getTtextonly(q)

Error:

File "C:\Users\DELL\Desktop\python\s\fyp\Relevancy\M\pyThinSearch\test.py", in getTtextonly
    html = urllib.urlopen(url).read()
  File "C:\Python27\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 463, in open_file
    return self.open_local_file(url)
  File "C:\Python27\lib\urllib.py", line 477, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified: '\\settings\\ads\\preferences?hl=en'

I am getting nervous and i really dont understand what error is really asking for....

user3162878
  • 514
  • 3
  • 8
  • 20

1 Answers1

1
q = link['href'].split('/url?q=')[1].split('&')[0]

q can be a relative URL.

If you see advertisements in https://www.google.com/search?q=apple, there is an a element whose href attribute starts with '/url?q=/settings/ads/preferences'.

According to the documentation of urllib.urlopen,

If the URL does not have a scheme identifier, or if it has file: as its scheme identifier, this opens a local file (without universal newlines); otherwise it opens a socket to a server somewhere on the network.

You should use urlparse.urljoin to make a URL absolute before passing it to urllib.urlopen.

pat
  • 48
  • 9
  • Thanks. After studying ur answer are u saying like this at the end: html = urlparse.urljoin(url).read() ?? then i am getting error ?? – user3162878 Apr 04 '16 at 19:01
  • @user3162878 No. I mean you should pass an absolute URL to `urllib.urlopen`. Use urlparse.urljoin to make `q` absolute. – pat Apr 04 '16 at 19:11
  • For example, `q = urlparse.urljoin(page.url, link['href'].split('/url?q=')[1].split('&')[0])` – pat Apr 04 '16 at 19:18
  • @user3162878 Actually I wonder if you want to keep that relative URL as you skip all URLs which contain 'webcache.googleusercontent.com'. It seems that you don't want Google URLs. – pat Apr 04 '16 at 19:38
  • thnx. but i got another problem. i am getting in findfromGoogle(self,a): at first place any idea?? – user3162878 Apr 05 '16 at 15:52
  • Is this an error? Which line do you get this? should be the HTTP status code of success. – pat Apr 05 '16 at 16:32
  • have a look at link for picture. Thats what i trying to understand whats this. i never got such thing before in my whole project..... – user3162878 Apr 05 '16 at 16:41
  • http://stackoverflow.com/questions/18810777/reading-the-response-in-python-requests – user3162878 Apr 05 '16 at 16:54
  • It shows the internal structure of `page`. See http://docs.python-requests.org/en/master/api/#requests.Response for details. – pat Apr 05 '16 at 17:00