0

What is the best approach to converting a pdf with unstructured data into a json object? What set of tools would you use in order to parse this data?

Of course I can convert the pdf to csv format and I am willing to do so if that will be simpler. I am only trying to process 3 different documents and I don't mind doing some of the work manually. I prefer to use java, but I also know some javascript and a tiny bit of python.

My ultimate goal is to use the json objects to populate a mongodb database and perhaps an elasticSearch index.

Any advice that you can provide would be much appreciated.

The document that I want to analyze is:

http://akccompanioneventresults.com/?moid=351

enter image description here I would like for the resulting json object to look like this:

{
"dogName" : "My Dolly Two Spots",
"armbandNumber" : 110,
"handler" : "Nancy Muller",
"breed" : "Papillon",
"round" : "Top 20",
"event" : "2019 AKC National Obedience Championship",
"date" : "2019-03-17",
"rings" :
    [
        {
            "ringNumber" : "1",
            "exercises" :
            [
                {
                    "exerciseName" : "DJ",
                    "judge" : "J Stephens",
                    "score" : 0.5
                }
                {
                    "exerciseName" : "DJ",
                    "judge" : "L Hause",
                    "score" : 1.0
                }
                {
                    "exerciseName" : "DR#3",
                    "judge" : "J Stephens",
                    "score" : 0.5
                }
                {
                    "exerciseName" : "DR#3",
                    "judge" : "L Hause",
                    "score" : 1.0
                }
                {
                    "exerciseName" : "Misc",
                    "judge" : "J Stephens",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "Misc",
                    "judge" : "L Hause",
                    "score" : 1.0
                }
            ] 
        }
        {
            "ringNumber" : "2",
            "exercises" :
            [
                {
                    "exerciseName" : "CD",
                    "judge" : "R Withers",
                    "score" : 1.0
                }
                {
                    "exerciseName" : "CD",
                    "judge" : "V Kinion",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "ROF",
                    "judge" : "R Withers",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "ROF",
                    "judge" : "V Kinion",
                    "score" : 0.0
                }

                {
                    "exerciseName" : "Misc",
                    "judge" : "R Withers",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "Misc",
                    "judge" : "V Kinion",
                    "score" : 0.0
                }
            ] 
        }
        {
            "ringNumber" : "7",
            "exercises" :
            [
                {
                    "exerciseName" : "RHJ",
                    "judge" : "C Wray",
                    "score" : 0.5
                }
                {
                    "exerciseName" : "RHJ",
                    "judge" : "J Nocilly",
                    "score" : 0.5
                }
                {
                    "exerciseName" : "HF-8",
                    "judge" : "C Wray",
                    "score" : 2.0
                }
                {
                    "exerciseName" : "HF-8",
                    "judge" : "J Nocilly",
                    "score" : 1.5
                }

                {
                    "exerciseName" : "Misc",
                    "judge" : "C Wray",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "Misc",
                    "judge" : "J Nocilly",
                    "score" : 0.0
                }
            ] 
        }
        {
            "ringNumber" : "8",
            "exercises" :
            [
                {
                    "exerciseName" : "DR",
                    "judge" : "B Lee",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "DR",
                    "judge" : "J Caputa",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "SE",
                    "judge" : "B Lee",
                    "score" : 2.0
                }
                {
                    "exerciseName" : "SE",
                    "judge" : "J Caputa",
                    "score" : 2.0
                }

                {
                    "exerciseName" : "Misc",
                    "judge" : "B Lee",
                    "score" : 0.0
                }
                {
                    "exerciseName" : "Misc",
                    "judge" : "J Caputa",
                    "score" : 0.0
                }
            ] 
        }
    ]

}...

Neil Lunn
  • 130,590
  • 33
  • 275
  • 280
  • You can read `PDF` file using `PDFBox` and convert to text format [Parsing PDF files (especially with tables) with PDFBox](https://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox). Parse this raw text file and convert it to `POJO` model. When you build model objects you can easily serialising them to `JSON` or `CSV` with library like [`Jackson`](https://github.com/FasterXML/jackson-databind). – Michał Ziober Mar 20 '19 at 22:26
  • You will need a PDF library/API that allows extraction of text with position data from a PDF (OCR is not necessary). You might try one like tabula which is optimized for table recognition but the structure of the data in your PDF is so weird I doubt that will help much. Some of the string data can be classified by positional analysis alone but some you'll have to analyze further, probably using regular expressions, in particular the lines with the dog names require this analysis. This will give you the text pieces with meaning which you can process further as indicated by @Michał. – mkl Mar 22 '19 at 10:36

0 Answers0