12

I want to parse a table from a .docx file using Python and python-docx into some useful data structure.

The .docx file contains only a single table in my case. I've uploaded it so you can have a look. Here's a screenshot:

Books.docx

Anto
  • 5,900
  • 7
  • 37
  • 60
Sreedhar
  • 187
  • 1
  • 2
  • 8
  • Post code and relevant materials here, not on some 3rd party site, and especially not with some shortened URL to some unknown website. – Cory Kramer Jan 09 '15 at 13:33
  • I have tried a lot of ways to parse it but, didn't get anything to work -- so didn't pasted code. I don't think it is useful if code is not working – Sreedhar Jan 09 '15 at 13:35
  • @Cyber I have attached docx file in that link - nothing other than this – Sreedhar Jan 09 '15 at 13:36

1 Answers1

31

You can use the snippet below to parse your document into a list where each row is a dictionary mapping the table header value to the column value.

from docx.api import Document

# Load the first table from your document. In your example file,
# there is only one table, so I just grab the first one.
document = Document('Books.docx')
table = document.tables[0]

# Data will be a list of rows represented as dictionaries
# containing each row's data.
data = []

keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    # Establish the mapping based on the first row
    # headers; these will become the keys of our dictionary
    if i == 0:
        keys = tuple(text)
        continue

    # Construct a dictionary for this row, mapping
    # keys to values for this row
    row_data = dict(zip(keys, text))
    data.append(row_data)

This will give you:

data = [
  {u'Pub.': u'Penguin Books',
   u'Auther': u'Edward de BONO',
   u'Sr. No.': u'1',
   u'Name of Book': u'Six Thinking Hats'
  },
  ...
]

If you'd just want a tuple for each row, you should instead of creating a dictionary just set row_data to the tuple value of text, so in the loop instead of constructing the dict, do:

# Construct a tuple for this row
row_data = tuple(text)
data.append(row_data)

Now, data would hold something like this instead:

data = [
  (u'1',
   u'Six Thinking Hats',
   u'Edward de BONO',
   u'Penguin Books'
  ),
 ...
]

Then you can skip constructing keys, obviously (but still skip the first row!).

vicvicvic
  • 5,475
  • 3
  • 31
  • 50
  • 3
    in addition to this,if `docx.api` is throwing error then you can use `from docx import Document` directly if the library you installed is `python-docx` instead of `docx`. Python-docx is compatible with both python2.x and 3.x – Aseem Yadav Nov 13 '18 at 09:49