0

I'm trying to web scrape using Python at me new job using the same method I used at my previous 2 jobs except it's not working now. Here's the code-

import urllib
from urllib import urlopen
url = 'http://www.google.com'
html = urllib.urlopen(url).read()

And the error is this-

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    html = urllib.urlopen(url).read()
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\urllib.py", line 350, in open_http
    h.endheaders(data)
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\httplib.py", line 1053, in endheaders
    self._send_output(message_body)
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\httplib.py", line 897, in _send_output
    self.send(msg)
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\httplib.py", line 859, in send
    self.connect()
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\httplib.py", line 836, in connect
    self.timeout, self.source_address)
  File "C:\Users\NREARDO2\AppData\Local\Continuum\Anaconda2\lib\socket.py", line 557, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

Is this because I'm working at a conglomerate and there's a security protocol preventing me from doing this or is there another way to get around it?

ivan_pozdeev
  • 28,628
  • 13
  • 85
  • 130
  • Refer below thread - You need to pass proxy attribute https://stackoverflow.com/questions/7334199/getaddrinfo-failed-what-does-that-mean/48788583#48788583 – Vinay Feb 16 '18 at 11:04

1 Answers1

0

According to Windows Sockets Error Codes - MSDN, error 11004 means:

WSANO_DATA 11004

Valid name, no data record of requested type.

The requested name is valid and was found in the database, but it does not have the correct associated data being resolved for. The usual example for this is a host name-to-address translation attempt (using gethostbyname or WSAAsyncGetHostByName) which uses the DNS (Domain Name Server). An MX record is returned but no A record—indicating the host itself exists, but is not directly reachable.

In human terms, this means that your host name (as extracted from the URL) is valid format-wise but cannot be resolved to a valid IP.

In other questions (1,2,3,4,5), people report having this problem if they:

  • have the name mapped to an invalid IP (like 0.0.0.0) in their hosts file
  • format URL incorrectly (like forgetting the third slash in file:/// or mistyping an IP)
  • use an unresolvable DNS name
  • have http_proxy environment variable or registry proxy settings pointing to a nonexisting host (if there are other problems with proxy settings or the proxy, they would result in another error, not this one)

In your case, 2) (and likely 4), too) is out of question, so check the others: if you can resolve the name with nslookup and if it's present in hosts.

Community
  • 1
  • 1
ivan_pozdeev
  • 28,628
  • 13
  • 85
  • 130
  • I'm not following what you're saying here. Can you provide me with some code or a procedure to determine how to work around this? – Nick Reardon Nov 22 '16 at 14:05
  • Hi is telling you: Check your nameserver. It looks like you are hitting a configuration issue on your box. E.g. by running nslookup from inside cmd to google – frlan Nov 22 '16 at 14:14
  • @NickReardon I added some howto links, read up on those things for more info. The gist is: the problem is external to your code and can very well result from your machine and/or corporate network configuration. You can diagnose it as suggested to find out what specifically is wrong and what you can do about it. – ivan_pozdeev Nov 22 '16 at 14:18