3

I am building a web crawler using urllib3. Example code:

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url)
mime_type = response.getheader("content-type")

I have stumbled upon few links to document files such as docx and epub and the mime type I'm getting from the server is text/plain.It is important to me to get the correct mime type.

Example to a problematic url:

http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx

Right now the logic of getting file's mime type is getting it from the server and if not available trying to get the file's extension.

How come Firefox is not getting confused by these kind of urls and let the user download the file right away? How does it know that this file is not plain text? How can i get the correct mime type?

Montoya
  • 2,010
  • 2
  • 21
  • 44

2 Answers2

1

I haven't read the Firefox source code, but I would guess that Firefox either tries to guess the filetype based on the URL, or refuses to render it inline if it's a specific Content-Type and larger than some maximum size, or perhaps it even inspects some of the file contents to figure out what it is based on a magic number at the start.

You can use the Python mimetypes module in the standard library to guess what the filetype is based on the URL:

import mimetypes
url = "http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx"
type, encoding = mimetypes.guess_type(url)

In this case, type is "application/vnd.openxmlformats-officedocument.wordprocessingml.document" which is probably what you want.

shazow
  • 14,207
  • 1
  • 24
  • 31
1

Unfortunately, text/plain is the right MIME type for your response, as stated here.

For text documents without specific subtype, text/plain should be used.

I tested your URL in Chrome and the behaviour you described for Firefox happened as well: Chrome downloaded the file instead of opening it, even with the Content type header being text/plain.

enter image description here

This means that those browsers use more than just this header to determine whether they should download or open the said file, which might include their own limitation to parse that file.

That said, you're not able to rely on the Content type header if you want to determine the real MIME type of whatever will come in the request's response. Maybe an alternative is to temporarily store the response's file and determine its MIME type afterwards.

lucasnadalutti
  • 5,380
  • 1
  • 22
  • 44