Python2.7. Access an HTTPS Website and retrieve content

Question

Good Morning All,

I've been trying to access a website through Python 2.7 that is HTTPS but have not been able to access the content, and days of research have not helped out. The website is: https://www.cioh.org.co/. In Python, I'd like to be able to access the page and retrieve all the HTML content. In the past, I'd use the ssl module and add the following lines of code at the top:

import ssl ssl._create_default_https_context = ssl._create_unverified_context

This time, that doesn't work and I get the error: SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661) when using the requests module with requests.get('https://www.cioh.org.co/')

From certain websites, some pointed out to use: import requests r = requests.get(URL, verify=False) print r.text

I've tried that as well but it doesn't actually scrape the content. It simply retrieves the internal header information from the website as such:

<html>

<head>

<META NAME="robots" CONTENT="noindex,nofollow">

<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">

</script>

<body>

</body></html>

The printed response is nothing like the website. Through countless research, I've tried using the certifi module. I also installed OpenSSL and extracted .crt, .key, and .pem files (and tried using them) and still no luck. I can expand upon further research I've done if need be.

The website, if using any browser can be easily accessed. Any help would be greatly appreciated.

Side Note: This is my first time creating an account and asking a question. If I wasn't clear with anything, please let me know. Thanks in advance.

That's not an "internal header", that's in fact the entire HTML document. Your browser then executes the JavaScript code behind the link; what you see rendered in the browser is the result of that. This is a FAQ. — tripleee, Jun 17 '18 at 15:07

score 2 · Answer 1 · edited Jun 17 '18 at 15:41

2

Judging by the Incapsula_Resource in the response, your request is blocked by WAF.

You can try changing the user-agent string in the requests.get call to look more like usual browsers, but the owner of the site clearly doesn't want automated scripts to scrape their pages.

edited Jun 17 '18 at 15:41

tripleee

139,311
24
207
268

answered Jun 17 '18 at 15:07

Andrew Morozko

1,979
14
15

https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula – Corey Goldberg Jun 17 '18 at 16:21

GoBear · Accepted Answer · 2018-06-18T08:50:20.883

1

Well apparently your code has to somehow mimic the browser, so I think you can do it this way:

from selenium import webdriver


def scrape_page(url):
    browser = webdriver.Firefox()
    browser.get(url)
    content = browser.page_source
    browser.close()
    return content


if __name__ == "__main__":
    print(scrape_page('https://www.cioh.org.co/'))

The implementation is pretty clumsy but it works and I hope you get the idea.

To get it going you will have to install geckodriver, here's the link with the instructions. In order to install selenium just type: pip3 install selenium

edited Jun 18 '18 at 08:50

answered Jun 17 '18 at 15:57

GoBear

298
2
16

1

I'm actually familiar with using selenium and webdriver for a few things. Got it to work utilizing the code you provided (after of course referencing where the webdriver was located). Unfortunately at where I'll be using the code, the webdriver .exe files do not cooperate with the policies, but, I was able to accomplish what I wanted using the command prompt (then using the subprocess module in python) to scrape the webpage. Thanks for your HELP!! – Darican Jun 17 '18 at 22:42
@Darican Always glad to help! – GoBear Jun 18 '18 at 08:51

Python2.7. Access an HTTPS Website and retrieve content

2 Answers2