URL tree walker in Python?

Question

For URLs that show file trees, such as Pypi packages, is there a small solid module to walk the URL tree and list it like ls -lR?
I gather (correct me) that there's no standard encoding of file attributes, link types, size, date ... in html <A attributes
so building a solid URLtree module on shifting sands is tough.
But surely this wheel (Unix file tree -> html -> treewalk API -> ls -lR or find) has been done?
(There seem to be several spiders / web crawlers / scrapers out there, but they look ugly and ad hoc so far, despite BeautifulSoup for parsing).

score 3 · Answer 1 · answered Mar 26 '09 at 19:21

Apache servers are very common, and they have a relatively standard way of listing file directories.

Here's a simple enough script that does what you want, you should be able to make it do what you want.

Usage: python list_apache_dir.py

import sys
import urllib
import re

parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
          # look for          a link    +  a timestamp  + a size ('-' for dir)
def list_apache_dir(url):
    try:
        html = urllib.urlopen(url).read()
    except IOError, e:
        print 'error fetching %s: %s' % (url, e)
        return
    if not url.endswith('/'):
        url += '/'
    files = parse_re.findall(html)
    dirs = []
    print url + ' :' 
    print '%4d file' % len(files) + 's' * (len(files) != 1)
    for name, date, size in files:
        if size.strip() == '-':
            size = 'dir'
        if name.endswith('/'):
            dirs += [name]
        print '%5s  %s  %s' % (size, date, name)

    for dir in dirs:
        print
        list_apache_dir(url + dir)

for url in sys.argv[1:]:
    print
    list_apache_dir(url)

Thank you sysrqb, nice. Where might one have learned this ? Also, is there any way of running $(unzip -l remote.zip) on the server, piping to a local file, to list big remote files ? — denis, Mar 28 '09 at 17:16
Pretty please, for anyone reading this after the fact remember [this classic answer to parsing XML/HTML with regex](http://stackoverflow.com/q/1732454). Also the few hundred others. In this particular circumstance the apache directory listing format _shouldn't_ change, but we all know what "shouldn't" means in software (especially UI related)... — Tim Alexander, Jan 27 '14 at 15:48
That's true, a real parser would be a more resilient solution, but any changes to the listing format would tend to break the scraper -- whether based on simple pattern matching or a proper grammar. — rmmh, Jan 27 '14 at 22:32

score 1 · Answer 2 · answered Aug 03 '09 at 15:37

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

It has CSS selectors as well so this sort of thing is trivial.

score 0 · Answer 3 · answered Apr 03 '09 at 08:49

Turns out that BeautifulSoup one-liners like these can turn <table> rows into Python --

from BeautifulSoup import BeautifulSoup

def trow_cols( trow ):
    """ soup.table( "tr" ) -> <td> strings like
        [None, u'Name', u'Last modified', u'Size', u'Description'] 
    """ 
    return [td.next.string for td in trow( "td" )]

def trow_headers( trow ):
    """ soup.table( "tr" ) -> <th> table header strings like
        [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40  ', u'8.9K'] 
    """ 
    return [th.next.string for th in trow( "th" )]

if __name__ == "__main__":
    ...
    soup = BeautifulSoup( html )
    if soup.table:
        trows = soup.table( "tr" )
        print "headers:", trow_headers( trows[0] )
        for row in trows[1:]:
            print trow_cols( row )

Compared to sysrqb's one-line regexp above, this is ... longer; who said

"You can parse some of the html all of the time, or all of the html some of the time, but not ..."

URL tree walker in Python?

3 Answers3