What object-oriented data structure for pdfQuery resultant data?

Question

I am currently building a program in python to scrape and parse pdfs hopefully a bit more elegantly than what is available currently.

The data structure hierarchy that is output from pdfquery in python is: (Hopefully this makes sense)

PDFDocument
    PDFPage[1]
        PDFElement[1]
        PDFElement[2]
        ...
        PDFEleement[i]
    PDFPage[2]
        PDFElement[1]
        PDFElement[2]
        ...
        PDFEleement[i]
    ....
    PDFPage[i]
        PDFElement[1]
        PDFElement[2]
        ...
        PDFElement[i]

I would like to create an OO python data structure that sets out the hierarchy as above. With the pdfElement class object[s] imbedded as attributes of pdfPage class object[s] which are imbedded as attributes of the pdfDocument element.

This would have to be done iteratively when creating the classes. I am wondering if this is the best way to structure the data or I would be better off doing something else? Also interested to know any thoughts on how "expensive" it might be if I have a few hundred pages each with maybe 30-50 elements.

Generally you would be looking for a tree structure here, have a read of the answers here initially: https://stackoverflow.com/questions/2358045/how-can-i-implement-a-tree-in-python — Ralph King, Mar 22 '21 at 19:04
A common way to implement a tree data structure in Python is by using dictionaries which can be nested if desired — see [What is the best way to implement nested dictionaries?](https://stackoverflow.com/questions/635483/what-is-the-best-way-to-implement-nested-dictionaries) — martineau, Mar 22 '21 at 19:24
There's no such thing as an object-oriented data structure — object-orientation is a programming paradigm. (Which is a way to classify programming languages based on their features.) — martineau, Mar 22 '21 at 19:25

What object-oriented data structure for pdfQuery resultant data?

0 Answers0