7

I'm trying to take html and generate some json that keeps the same structure.

I'm trying to use pandoc, as i've had some success in transforming things from format A to format B using pandoc before.

I'm trying to convert this file:

example.html

<p>Hello guys! What's up?</p>

Using the command:

pandoc -f html -t json example.html

What i expect is something like:

[{ "p": "Hello guys! What's up?"}]

What i get is:

[
  { "Para":
    [
      {"t": "Str", "c": "Hello"},
      {"t": "Space"},
      {"t": "Str", "c": "guys!"},
      {"t": "Space"},
      {"t": "Str", "c": "What's"},
      {"t": "Space"},
      {"t": "Str", "c": "up?"}
    ]
  }
]

The problem seems to be that when pandoc reads the text content, it separates every word based on the space character and makes an array out of it, while i expected pandoc to understand that the whole string is a single element.

I'm a beginner at pandoc and I've not been able to find out how to tweak that behavior.

Do you have an idea of how I can get the desired output? Do you know another tool that can do this? The tool, or the language it's written in doesn't matter.

Thanks.

Edit: You can test that behavior online on that pandoc online tool.

Edit 2: Workaround. I couldn't find how to do the HTML->JSON conversion with pandoc. As a workaround, i used the suggestion proposed in the comments, and implemented a solution using Himalaya, which is a node package. The result is exactly what i wished for, even though it's not using pandoc.

Loïc N.
  • 193
  • 1
  • 14
  • 1
    Not to be smart, but why are you expecting such a result ? I don't see in the documentation the output formatting. – Pogrindis Sep 21 '18 at 08:58
  • 1
    I know of one project that works closer to what you're expecting though : https://github.com/andrejewski/himalaya Demo : https://jew.ski/himalaya/ – Pogrindis Sep 21 '18 at 09:09
  • @Pogrindis Hi, I'm not trying to say that this behavior is inconsistent with the documentation. It is mostly that i expected it to behave differently. That's why i'm asking for help. Maybe there is an option to modify this behavior, but i don't know pandoc. So i was hoping someone here would know. Thanks for the link to himalaya, i'm gonna check it out. – Loïc N. Sep 21 '18 at 09:15
  • @LoïcN. The JSON in the question is similar to the output produced by pandoc, but not identical. I assume that's because some manual transcription was involved. I recommend using [`jq`](https://stedolan.github.io/jq/) to get human-readable JSON like so: `echo "

    Hello guys! What's up?

    " | pandoc -f html -t json | jq`.
    – tarleb Sep 21 '18 at 11:36
  • @tarleb Hi tarleb. Thanks for the proposition. I tested your solution, but from what i can see, it restructured the result in a way that is easier to read, but the structure itself seems unchanged. – Loïc N. Sep 21 '18 at 11:47
  • Right. My comment solely exists to point out how one could have inserted the actual JSON output in the question. See my edit. – tarleb Sep 21 '18 at 12:00
  • @tarleb Yup, sorry about that. I misread your answer. Yes, you were right, i first used a written transcription which is why the output wasn't exactly the same as the actual output. Thanks for editing my original post with the correct value. – Loïc N. Sep 21 '18 at 12:03
  • @PoGrindis I went with Himalaya, thanks. – Loïc N. Sep 24 '18 at 12:50

2 Answers2

3

Currently, the pandoc JSON representation is not very human-readable, but is auto-generated from the Haskell pandoc data types (aka document AST). There is some discussion to change that eventually.

I guess you're looking for something like https://codebeautify.org/xmltojson? There also seem to be plenty of commandline-tools that do that.

mb21
  • 28,026
  • 6
  • 96
  • 118
  • Hi. I'm exactly look for something like codebeautify, but they don't specify the tools they use under their web interface. I've tried using the node module **xml-js**, but it produces errors when trying to read my html, while it works very well for other actual xml files that are more complex than what i tested. This is why i'm trying to get html->json rather than xml->json. Unless i can find a xml tool that is tolerant of html. As long as it works, that's all that matters. I'm testing **himalaya** at the moment, as recommended by Podgrindis. – Loïc N. Sep 21 '18 at 12:22
3

Pandoc, It's a tool to convert documents, the json representation of the document, It's just another representation that Pandoc can handle for the AST (Abstract Syntax Tree)

Original Document --> Pandoc's AST --> Output Document
                   |                |
                pandoc           pandoc

Asking pandoc, to output a json, is to ask for the AST tree in it's json format,

If I understand correctly you would need something more like a xml to json converter like this Python xmljson module or an online tool like this one.

There are plenty of tools for that job as you picture it, just google XML to JSON convert.

The json representation of the AST used in pandoc, it normally used to output it from pandoc, and pipe it into another program that can handle json files, so you can alter the AST and make filters to manipulate the structure of your document.

ekiim
  • 598
  • 6
  • 21