0

In python's requests if I follow the response's history url, it provides me with the URL of the redirection as follows:

import requests
response = requests.get('https://yahoo.com')
for resp in response.history:
    print(resp.url, resp.text)

Q: Anyone has idea where from the redirection urls are taken? headers? If the header does not contain location item but makes redirection, how will it identifies the redirection URL? Can you provide references please?

EDIT:

I looked at the documentation. It does not say "how". Some answers show that it is the headers['Location']. I am not sure. Are the redirection URLs that I extract from the history response (item by item) is just the 'Location' header in each response? or is there anything else that the library use to identify the redirection URLs? May be some python expert can help?

user9371654
  • 1,478
  • 6
  • 31
  • 56

3 Answers3

0

HTTP redirects generally take the form of a 3xx response code plus a "Location:" header which indicates where to redirect to. This is codified in the HTTP protocol, and so any conformant client implementation will simply do whatever that spec says.

See RFC 7231 Section 6.4.

In so many words, if you call requests to visit a URL (with redirection allowed - it can be turned off with an option in requests) and the server says "go here instead", requests will internally call itself on the new URL, and add the previous one to the history, as many times as it takes to reach a page which does not redirect, or you hit the limit (commonly set to something like 30 to prevent shenanigans such as a page redirecting to itself in an endless loop).

Many web servers such as CMSes rely on server-side URL rewriting configurations which allow a programmer to generate a (structurally) simple URL which the server then resolves and redirects to a different location which may be more friendly to the human eye or conform to a unified convention defined by that server's administrator, and some content delivery networks use redirection to send each visitor to a server which is close to them geographically or in terms of network topology. Clicktracking also frequently causes your browser to jump via a unique URL before sending it off to actually fetch the content it is trying to display. Because of these techniques, it's not uncommon to see multiple redirects when you attempt to fetch something.

In addition, but really outside of what requests or similar libraries support, interactive browsers also commonly support JavaScript, which allows a web page to run code in the browser which may cause it to visit a new page under programmatic control (i.e. perhaps under complex conditions which might not even be entirely deterministic). If you need to support this, the currently popular solution is to run a real interactive browser (perhaps "headless", i.e. with no observable user interface) and have it communicate its state to Python somehow.

tripleee
  • 139,311
  • 24
  • 207
  • 268
  • thanks. your answer provides many ways of how redirection happen. My question, which of these methods does requests take into account to perform redirection? does it consider the location header + response code only? or other server-side methods, etc.? – user9371654 Apr 13 '19 at 09:10
  • I'm sorry, I don't really see how I could make this clearer. There is only one server-side redirection technique and that is the 3xx HTTP redirect that you describe. The fourth paragraph explores some common scenarios in which servers might do this. – tripleee Apr 13 '19 at 10:46
0

I guess you misunderstand how redirection works.

Redirection is a client-side action which means if you don't do redirection, you won't be redirected. So actually requests does that redirection for you. It is no surprise that it can trace history.

Let's say if you send a request to a.com and the response is redirecting to b.com, then requests will do another request to b.com and add that a.com to history.

If the response of b.com is also redirecting to, let's say, c.com, then requests will do the same thing: do another request to c.com and add that b.com to history.

Here is the related method resolve_redirects, it is a generator and I believe it is not hard to understand.

Sraw
  • 14,837
  • 5
  • 37
  • 62