0

I'm working on a chrome extension that sends the source code of a page to a server where it should be parsed.

Capturing the source code is working fine, if I display it in the console, it looks like this:

enter image description here

Then in order to push it to my PHP server, I first isolate the content of the body (what you've seen in the previous picture is stored in "result"):

html_content = result.querySelectorAll('body')[0].outerHTML;
html_content =JSON.stringify (html_content);

If I then display html_content in my console, I get something like this:

enter image description here

So now that I have a JSON object, I try to send it through this:

var xhr = new XMLHttpRequest(); 
xhr.open("POST", "myAPI_URL");
xhr.setRequestHeader("Content-Type", "application/json");
xhr.send(html_content);

The call to the url works but I don't get anything in $_POST. It's empty

If I try to assign a specific variable like this:

xhr.send('content='+html_content);

It doesn't work either. On the PHP side, I'm just doing this:

print_r($_POST);

And this returns an empty array.

======= UPDATE =========

Based on the feedback below, I adapted a few things and it gets better. As suggested I'm using text/plain and I keep the DOM object intact (I don't take only the body)

            var xhr = new XMLHttpRequest(); 
            xhr.open("POST", "myAPI URL");
            xhr.setRequestHeader("Content-Type", "text/plain");
            xhr.send(content);

If I use this on the server side:

$html_content = file_get_contents('php://input');

This variable contains the text string as expected so that's great but now if I try to parse the received html, it goes wrong.

$html_content = file_get_contents('php://input');
$dom = new DOMDocument;
$dom->loadHTML($html_content);

When doing this I get warnings like

<b>Warning</b>:  DOMDocument::loadHTML(): ID ghostery-no-tracker already defined in Entity, line: 506 in <b> my url </b> on line <b>25</b><br />

It's like it doesn't understand the html correctly.

Any idea?

Laurent
  • 997
  • 10
  • 31
  • 1
    you shouldn't need the querySelectorAll here, `document.body` is a universal accessor for the body element. Furthermore, I'd advise against JSON-i-fying this. Just POST it as `text/plain` to your server, using the modern `Fetch` API, rather than the old XMLHttpRequest object: https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#Supplying_request_options - if you do want this code, then make sure to check for errors during transport, by assigning `onerror` and `onreadystatechange` handlers – Mike 'Pomax' Kamermans Aug 07 '18 at 21:12
  • 1
    Or to add to @Mike'Pomax'Kamermans, at least `result.querySelector('body')` instead or `.querySelectorAll()`. – Scott Marcus Aug 07 '18 at 21:14
  • 2
    one can get the posted data with `php://input` ... `$_POST` fields do require a form. – Martin Zeitler Aug 07 '18 at 21:14
  • 2
    @MartinZeitler this is rather untrue. As long as the HTTP verb used was POST, the $_POST variable will get populated. – Mike 'Pomax' Kamermans Aug 07 '18 at 21:16
  • 2
    @Mike'Pomax'Kamermans HTTP method `POST` is closely related content-type `multipart/form-data` or `application/x-www-form-urlencoded`, considering the size of the payload. what I've meant is, to put the HTML payload into the request-body and not the request-headers, as `text/html` - in this situation, it cannot be accessed through the `$_POST` (which will be empty). – Martin Zeitler Aug 07 '18 at 21:22
  • 1
    that's a very different thing. The question clearly shows the HTML code being sent as payload, not as a request header. – Mike 'Pomax' Kamermans Aug 07 '18 at 21:34
  • @Mike'Pomax'Kamermans I'll give a try to fetch later but your suggestion for text/plain worked! – Laurent Aug 08 '18 at 06:51
  • @ScottMarcus good point! – Laurent Aug 08 '18 at 06:52
  • @MartinZeitler you are right, I was using this in my previous version of the script, now it's back. – Laurent Aug 08 '18 at 06:52
  • 1
    Your HTML contains duplicate IDs - either because that was in your original HTML already, or because the JS on the page created new elements (which you then would get in the outerHTML as well.) See if you can fix that on your end beforehand, or try instructing the DOM parser to ignore such errors, https://stackoverflow.com/questions/1148928/disable-warnings-when-loading-non-well-formed-html-by-domdocument-php – CBroe Aug 08 '18 at 06:56
  • @CBroe that makes sense, it's not my HTML which means that I'll never manage to avoid this without something to ignore warnings, thanks! – Laurent Aug 08 '18 at 16:15

1 Answers1

1

Martin commented (almost) correctly

one can get the posted data with php://input ... $_POST fields do require a form.

PHP does not populate the superglobal if the the Content-Type is not one of the form-data content types. If think the reason behind this is simply because this is the only format implemented in PHP to map values to keys.

But the data is still there!

You can read it from php://input, or (even better:) directly from STDIN which is a constant with an open stream to the former destination.

http://php.net/manual/en/wrappers.php.php

Do not use $HTTP_RAW_POST_DATA as it is deprecated / removed.

Update

Please do only ask one question per thread, especially if the two things are not related.

DOMDocument shows the warnings not because it doesn't understand the HTML but because the HTML is buggy ;-)

It's up to you how to handle the warnings, if you ignore them or fix the input. Do not expect DOMDocument to be as forgiving as a modern browser.

Daniel W.
  • 26,503
  • 9
  • 78
  • 128