2

I would like to extract IP address and port number from the this link. Here is my Python code: http://spys.one/free-proxy-list/FR/

import urllib.request
import re

url = 'http://spys.one/free-proxy-list/FR/'

req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read().decode('utf-8')

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}',html )

# ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}:[0-9]+[0-9]',html)  # This is also not working

print (ip)

Output - ['37.59.0.139', '212.47.239.185', '85.248.227.165', '167.114.250.199', '51.15.86.160', '212.83.164.85', '82.224.48.173']

I get only IP address but not the port numbers.

I'm expecting something like this - '37.59.0.139:17658'

dini
  • 35
  • 5
  • `r'\d{1,3}(?:\.\d{1,3}){3}:\d+'` works for me. i substituted the `[0-9]` blocks with the corresponding class `\d`. but take a look into the returned text: the string you search for is not returned in that format. instead, they put additional html-elements between the IP and the port – Dave J Sep 29 '17 at 11:53

2 Answers2

0

First, you've a bit of a wonky part of your regex: you have (?:, you probably mean (:?. Not sure what the former means, but the latter means zero or one :

Your regex is only looking for four groupings of numbers split by : or .. You need up to five groups of numbers: 0.0.0.0:0000 = five groups. Try this instead:

re.findall( r'([0-9]{1,3}\.){3}[0-9]{1,3}(:[0-9]{2,4})?'
  • [0-9]{1,3} = between one and 3 digits
  • \. = a period (escaped, because . means "any character")
  • {3} = the above needs to be repeated exactly three times
  • (:[0-9]{2,4}) a colon followed by a numeric sequence between two and four characters long. This is your port.
  • ? the port is optional, it will either be there or it won't.
cwallenpoole
  • 72,280
  • 22
  • 119
  • 159
0

Your code does not work because -- aside from several issues with your regex that have been pointed out in other answers -- the website you provided displays the port number of each IP by executing some javascript in the underlying HTML code.

In order to capture each IP and its associated port number, you first need to execute the javascript so that the port numbers are properly printed in the HTML response (you can follow the guidelines here: Web-scraping JavaScript page with Python). Then you need to extract this information from the javascript-computed HTML response.

By inspecting the HTML response, I found out that each port number is preceded by :</font> and followed by <.

A working code snippet can be found below. I took the liberty of slightly modifying your IP-regex as only certain IP addresses were associated with a port number (other IPs were related to the hostname column and should be discarded) - namely, the IPs of interest are those followed by the <script string.

import dryscrape
import re

url = 'http://spys.one/free-proxy-list/FR/'



#get html with javascript
session = dryscrape.Session()
session.visit(url)
response = session.body()


#capture ip:
IP = re.findall(r'[0-9]+(?:\.[0-9]+){3}(?=<script)',response)

#capture port:
port = re.findall(r'(?<=:</font>)(.*?)(?=\<)',response)

#join IP with ports
IP_with_ports = []
for i in range(len(IP)):
    IP_with_ports.append(IP[i] + ":" + port[i])


print (IP_with_ports)

OUTPUT: ['178.32.213.128:80', '151.80.207.148:80', '134.119.223.242:80', '37.59.0.139:17459', ..., '37.59.0.139:17658']

Do note that the code above only works for the website you provided, as each website has its own logic for displaying data.

P. Shark
  • 198
  • 8
  • Thanks P. Shark. I get some error while installing dryscrape module. I follow this document to install in my Ubuntu OS - but not successful. https://media.readthedocs.org/pdf/dryscrape/latest/dryscrape.pdf PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/xvfbwrapper.py' – dini Sep 29 '17 at 16:22
  • Take a look at the first answer in the link I provided: you will find a link to the *dryscrape* github page, which provides instructions as to how to properly install it on an Ubuntu environment. Just make sure you execute the commands as *sudo* ;) – P. Shark Sep 29 '17 at 16:27
  • http://www.gatherproxy.com/proxylist/country/?c=Italy @shark - I tried the regex for this site , but I couldn't get through. Can you please give the regex to find IP and PORT. – dini Oct 06 '17 at 01:28
  • r'[0-9]+(?:\.[0-9]+){3}:[0-9]+[0-9]?' This worked . – dini Oct 06 '17 at 03:11