0

I'm using Zend Framework2 and trying to filter content of <form> tag from whole HTML.

I'm scrapping the page from different site and the page loads after some time and huge full page loader is there.

I have tried with DomDocument and with phpQuery but didn't get success.

This is with DomDocument

$htmlForm = new \DOMDocument();
$htmlForm->loadHTML($formData);
$onlyForm = $htmlForm->getElementById('#Frmswift');
echo $htmlForm->saveHTML($onlyForm);

This is with phpQuery

$doc = phpQuery::newDocument($formData);
$doc->find('#Frmswift')->parent()->siblings()->remove();
echo pq($doc)->html();

Where am I wrong?

Keyur
  • 980
  • 1
  • 20
  • 38

2 Answers2

2

If I good understood, there is a site which loads HTML form dynamically on DOM event or other way. If so, then you will not be able to scrape this form in PHP, unless you know the url which is triggered when site is loading form dynamically.
Check in Chrome's dev tool -> network and see the XHR requests that has been made.

DOMDocument::loadHTML() loads "raw" DOM object- not manipulated by JavaScript code, so you can't use getElementById('#Frmswift') because this element does not exist yet.
PHP for web scraping is not a good option. I would suggest you to do it in Node.js or using Phantom.js.

SzymonM
  • 894
  • 8
  • 14
  • Thanks for answer. But to do this, I have to specially install node for only scrapping the web page. Which will be inconvenient. – Keyur Mar 02 '17 at 09:49
1

EDIT

Okay check this YouTube video. There is well explained how to use chrome's developer tools specifically Network tab(this is quite analogically for Firefox). So go on the website that is holding the <form> from your question -> right click and Inspect Element, then:

  1. When you are on the Network tab you can filter the list to see only XHR request

  2. Go through the list of requests and check the result of each request in Response sub-tab(which on the video is in the bottom-right side of the screen). You should find from which request is coming the HTML of this form.

  3. Then if you succeed to find this - you know where the form is coming from, select this request in the developer tools console(we are on Network tab now) and again in bottom-right go to Headers sub-tab.

  4. Copy the Request URL - this is from where the form HTML will come

  5. Check Request Method

    5.1. If it is GET then use PHP's $htmlForm = file_get_contents(URL from point 4); and proceed with ORIGINAL POST as you replace $sampleHtml with $htmlForm.

    5.2. If it is POST refer to this link or google search or this stackoverflow answer and again with the result proceed with ORIGINAL POST

ORIGINAL POST

Hello_ mate.

I see a mistake in your code snippet - you don't need # when using getElementById

Check the following code snippet and let me know if it helps you (refer to comments for details):

$sampleHtml = ' 
    <!DOCTYPE html>
    <html>
    <head>
        <title>External Page Content</title>
    </head>
    <body>
        <h1>Some header</h1>
        <p>Some lorem text ....</p>
        <form id="Frmswift">
            <input name="input1" type="text">
            <input name="input2" type="text">
            <textarea name="mytextarea"></textarea>
        </form>
    </body>
    </html>';

$dom = new \DOMDocument();
$dom->loadHTML($sampleHtml);

// Where you use getElementById do not put # in front of the selector 
// This method is working analogically to javascript's getElementById()
$form = $dom->getElementById('Frmswift');

// Use second blank document which with hold
// the previously selected form
$blankDoc = new \DOMDocument();
$blankDoc->appendChild($blankDoc->importNode($form, true));

// using htmlspecialchars just to show the code, 
// otherwise you will see imputs in the browser - this is just 
// for the testing purpose. I suppose you will need the $blankDoc
// which is holding only the form
echo htmlspecialchars($blankDoc->saveHTML());
exit;

Output:

<form id="Frmswift"> 
    <input name="input1" type="text">
    <input name="input2" type="text">
    <textarea name="mytextarea"></textarea>
</form>
Community
  • 1
  • 1
codtex
  • 4,396
  • 1
  • 11
  • 28
  • Thanks but Getting this error: `Fatal error: Uncaught TypeError: Argument 1 passed to DOMDocument::importNode() must be an instance of DOMNode, null given` – Keyur Mar 02 '17 at 09:46
  • Is your form coming as `HTML` at all? This error means that `$form` is null when you pass it to `$blankDoc->importNode($form, true)`. This makes me think that this `$form = $dom->getElementById('Frmswift');` returned `null` ... Something is wrong. Can you tell me what do you have in `$formData` ? – codtex Mar 02 '17 at 12:17
  • Yes. As mentioned in the quote of my question, the form tag is coming bit late. Please suggest what should I do for that? – Keyur Mar 02 '17 at 12:26
  • 1
    Then yes this methodology will not work, you need to expect exactly from which URL is coming this form via browser's developer tools (when you inspect element, as mentioned in the other answer and see all HTTP request). Then when you figure out from where this form is coming you can use PHP's `curl` function if you need to do `POST` or `file_get_content()` if `GET`. I will try searching for a video how to do it – codtex Mar 02 '17 at 20:58
  • Okay. Thanks for your efforts. – Keyur Mar 03 '17 at 09:44
  • @KeyurK did you succeed to do it? – codtex Mar 04 '17 at 07:45
  • Not yet. I'm on holidays. But seems it will work well. Will implement and let you know. – Keyur Mar 04 '17 at 09:20
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/137205/discussion-between-keyur-k-and-sand). – Keyur Mar 04 '17 at 09:25