I am currently building a program in python to scrape and parse pdfs hopefully a bit more elegantly than what is available currently.
The data structure hierarchy that is output from pdfquery in python is: (Hopefully this makes sense)
PDFDocument
PDFPage[1]
PDFElement[1]
PDFElement[2]
...
PDFEleement[i]
PDFPage[2]
PDFElement[1]
PDFElement[2]
...
PDFEleement[i]
....
PDFPage[i]
PDFElement[1]
PDFElement[2]
...
PDFElement[i]
I would like to create an OO python data structure that sets out the hierarchy as above. With the pdfElement class object[s] imbedded as attributes of pdfPage class object[s] which are imbedded as attributes of the pdfDocument element.
This would have to be done iteratively when creating the classes. I am wondering if this is the best way to structure the data or I would be better off doing something else? Also interested to know any thoughts on how "expensive" it might be if I have a few hundred pages each with maybe 30-50 elements.