0

I am currently building a program in python to scrape and parse pdfs hopefully a bit more elegantly than what is available currently.

The data structure hierarchy that is output from pdfquery in python is: (Hopefully this makes sense)

PDFDocument
    PDFPage[1]
        PDFElement[1]
        PDFElement[2]
        ...
        PDFEleement[i]
    PDFPage[2]
        PDFElement[1]
        PDFElement[2]
        ...
        PDFEleement[i]
    ....
    PDFPage[i]
        PDFElement[1]
        PDFElement[2]
        ...
        PDFElement[i]

I would like to create an OO python data structure that sets out the hierarchy as above. With the pdfElement class object[s] imbedded as attributes of pdfPage class object[s] which are imbedded as attributes of the pdfDocument element.

This would have to be done iteratively when creating the classes. I am wondering if this is the best way to structure the data or I would be better off doing something else? Also interested to know any thoughts on how "expensive" it might be if I have a few hundred pages each with maybe 30-50 elements.

martineau
  • 99,260
  • 22
  • 139
  • 249
  • Generally you would be looking for a tree structure here, have a read of the answers here initially: https://stackoverflow.com/questions/2358045/how-can-i-implement-a-tree-in-python – Ralph King Mar 22 '21 at 19:04
  • A common way to implement a tree data structure in Python is by using dictionaries which can be nested if desired — see [What is the best way to implement nested dictionaries?](https://stackoverflow.com/questions/635483/what-is-the-best-way-to-implement-nested-dictionaries) – martineau Mar 22 '21 at 19:24
  • There's no such thing as an object-oriented data structure — object-orientation is a programming paradigm. (Which is a way to classify programming languages based on their features.) – martineau Mar 22 '21 at 19:25

0 Answers0