I am building a web crawler using urllib3
. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
mime_type = response.getheader("content-type")
I have stumbled upon few links to document files such as docx and epub and the mime type I'm getting from the server is text/plain
.It is important to me to get the correct mime type.
Example to a problematic url:
Right now the logic of getting file's mime type is getting it from the server and if not available trying to get the file's extension.
How come Firefox
is not getting confused by these kind of urls and let the user download the file right away? How does it know that this file is not plain text? How can i get the correct mime type?