0

I have kept a good number of files in a folder. Now I want to read all of them. They are in different formats and different encoding. Using listdir/glob.glob I am able to find the list but how to open/read or process them for different encodings?

If any one can help me out.I am using Python3.2 on Windows.

Regards, Subhabrata Banerjee.

Ashwini Chaudhary
  • 217,951
  • 48
  • 415
  • 461
SUBHABRATA
  • 123
  • 7

3 Answers3

2

Assuming you know which files are in which encodings, use codecs.open(). It works almost exactly like regular open(), but takes an optional encoding parameter.

If you don't know which files are in which encodings then it's more difficult. You can try something like chardet or the other answers to this question

Community
  • 1
  • 1
Jeremiah
  • 1,369
  • 7
  • 15
  • 1
    and by 'more difficult' you mean 'impossible' in general case. `open()` on Python 3 supports `encoding` parameter – jfs Jul 10 '12 at 18:05
1

open(fp) is the standard way of opening a file in python see: http://docs.python.org/library/functions.html#open

once open you can read the the file by .read() as specified by http://docs.python.org/library/stdtypes.html#bltin-file-objects

The encodings are going to be more tricky, and will be based on how you know what encoding is used for each file

Andrew Cox
  • 9,902
  • 2
  • 31
  • 38
0

As Jeremiah wrote (for ), codecs.open() does for Python 2 what modernized open() does for Python 3. The encoding argument says what encoding is used inside the file.

However, the important difference is that if codecs.open() is used, then read lines are unicode strings (and to be written lines are expected to be unicode strings), not the plain old strings (i.e. sequences of bytes). It is more natural feeling in Python 3, but it can also be done in Python 2 this way.

I do recommend to read Mark Pilgrim's Dive Into Python 3, Chapter 4. Strings.

His Chapter 15. Case Study: Porting chardet to Python 3 explains how the mentioned chardet module works.

Community
  • 1
  • 1
pepr
  • 18,012
  • 11
  • 66
  • 122