1

I want to classify file types based on their extensions in python.Before writing it up myself i wanted to check if there is any python package which can be used for this purpose. By file type i mean to classify it as eg. Doc,ppt,pdf,tar,txt,iso etc. ideally it would take the file name as input and return its type.i am running on linux

auny
  • 1,782
  • 4
  • 15
  • 35
  • A file's extension has nothing to do with its type. – Burhan Khalid Sep 04 '12 at 06:48
  • 3
    Take a look at this question: http://stackoverflow.com/questions/43580/how-to-find-the-mime-type-of-a-file-in-python . You can *guess* by extension using `mimetypes`, but something like the `python-magic` (mentioned in the second answer) may be more reliable. – kwatford Sep 04 '12 at 06:51
  • Not *nothing* (you hope they're related), but they are definitely not the same thing. Eg., You can totally change the extension of a `.jpg` to a `.doc`, but the type is still jpeg. – Matthew Adams Sep 04 '12 at 06:53
  • i just want to classify based on what the extension says. Not bothered about the actual content of the file. Any help now? – auny Sep 04 '12 at 06:57

2 Answers2

2

You should look into a document metadata parser. I have used Apache Tika which is a java library in some of my projects. You can look at this question Python-based document metadata parser? to see how to use it in Python

Community
  • 1
  • 1
Pratik Mandrekar
  • 8,388
  • 3
  • 33
  • 59
1

In Linux you can use 'file' utillity which determine file type. So if you want you can use it and in your scripts too:

import subprocess
subprocess.call(['file', 'yourfile'])
Denis
  • 6,117
  • 6
  • 35
  • 56
  • 1
    Command 'file' uses libmagic library, there is a 'python-magic' module that provides native interface and uses the same logic. – neutrinus Mar 13 '13 at 15:57