4

I'm trying to use a python script to edit a large directory of .html files in a loop. I'm having trouble looping through the filenames using os.walk(). This chunk of code just turns the html files into strings that I can work with, but the script does not even enter the loop, as if the files don't exist. Basically it prints point1 but never reaches point2. The script ends without an error message. The directory is set up inside the folder called "amazon", and there is one level of 20 subfolders inside of it with 20 html files in each of those.

Oddly the code works perfectly on a neighboring directory that only contains .txt files, but it seems like it's not grabbing my .html files for some reason. Is there something I don't understand about the structure of the for root, dirs, filenames in os.walk() loop? This is my first time using os.walk, and I've looked at a number of other pages on this site to try to make it work.

import os

rootdir = 'C:\filepath\amazon'
print "point1"
for root, dirs, filenames in os.walk(rootdir):
    print "point2"
    for file in filenames:
        with open (os.path.join(root, file), 'r') as myfile:
             g = myfile.read()
        print g

Any help is much appreciated.

user3087978
  • 45
  • 1
  • 5

4 Answers4

6

The backslash is used as an escape. Either double them, or use "raw strings" by putting a prefix "r" on it.

Example:

>>> 'C:\filepath\amazon'
'C:\x0cilepath\x07mazon'
>>> r'\x'
'\\x'
>>> '\x'
ValueError: invalid \x escape

Explanation: In Python, what does preceding a string literal with “r” mean?

Community
  • 1
  • 1
johntellsall
  • 11,853
  • 3
  • 37
  • 32
  • 1
    I would include Huu's note about `os.path.join`. Interoperability is one of the best things about Python, so we may as well use the stdlib function included to join filenames together! :) – Adam Smith May 30 '14 at 23:53
2

Your problem is that you're using backslashes in your path:

>>> rootdir = 'C:\filepath\amazon'
>>> rootdir
'C:\x0cilepath\x07mazon'
>>> print(rootdir)
C:
  ilepathmazon

Because Python strings use the backslash to escape special characters, in your rootdir the \f represents an ASCII Form Feed character, and the \a represents an ASCII Bell character.

You can either use a raw string (note the r before the apostrophe) to avoid this:

>>> rootdir = r'C:\filepath\amazon'
>>> rootdir
'C:\\filepath\\amazon'
>>> print(rootdir)
C:\filepath\amazon

... or just use regular slashes, which work fine on Windows anyway:

>>> rootdir = 'C:/filepath/amazon'
>>> rootdir
'C:/filepath/amazon'
>>> print(rootdir)
C:/filepath/amazon

As Huu Nguyen points out, it's considered good practice to construct paths using os.path.join() when possible ... that way you avoid the problem altogether:

>>> rootdir = os.path.join('C:', 'filepath', 'amazon')
>>> rootdir
'C:\\filepath\\amazon'  # presumably ... I don't use Windows.
>>> print(rootdir)
C:\filepath\amazon
Zero Piraeus
  • 47,176
  • 24
  • 135
  • 148
2

You can avoid having to explicitly handle slashes of any sort by using os.path.join:

rootdir = os.path.join('C:', 'filepath', 'amazon')
huu
  • 5,894
  • 1
  • 31
  • 46
0

I had an issue that sounds similar to this with os.walk. The escape character (\) added to filepaths on Mac due to spaces in the path was causing the problem.

For example, the path:

/Volumes/MacHD/My Folder/MyFiles/...

when accessed via Terminal is shown as:

/Volumes/MacHD/My\ Folder/MyFiles/...

The solution was to read the path to a string and then create a new string that removed the escape characters, e.g:

# Ask user for directory tree to scan for master files
masterpathraw = raw_input("Specify directory of master files:")
# Clear escape characters from the path
masterpath = masterpathraw.replace('\\', '')
# Provide this path to os.walk
for fullpath, _, filenames in os.walk(masterpath):
    # Do stuff
Ruggero Turra
  • 14,523
  • 14
  • 72
  • 123