3

I am relatively new to node.js and I am looking at hacking some of my company's products together. However one of the systems is written for Node.js and the other system, that I don't have access too, is controlled by a standard login page. This login holds a few key data points that I would like to pull scrape out of the html. I would like to do this behind the scenes and don't want to show the webpage or anything. I just want to execute the form submission, and get the request.

Can anyone point in the right direction?

Eric V
  • 249
  • 5
  • 16

1 Answers1

7

There can be different levels of automation required depending on how complex your sign in flow is and how the underlying system is built.

Do it via an API

First, don't rely on screen scraping for anything. It's BAD and prone to failure. When the underlying application is updated, no one thinks about screen scrapers and things change. If there's a REST API or some other type of RPC (Remote Procedure Call) to be used, use that instead. If there's not, ask for an API. Only after that should you try screen scraping.

Low level HTTP requests

You might be able to emulate the HTTP requests without trying to emulate a browser completely. Complete the requests in a browser first while the Network Monitor in your Developer Tools is open. Find the minimal number of requests you need. Sometimes this is just a POST to /login with username and password fields. Sometimes you will need to store a cookie and then request the required page with your user session.

Use needle or the more common, but more heavyweight request.

Headless Browsers

Headless browsers are the first step into UI and allow you to not worry about what the backend HTTP requests do. You tell the API to fill in the login field and password field and submit the form. A headless browser will do the background work for you, like cookies and redirects, and returns a rendered web page.

Use Zombie.js, PhantomJS, CasperJS.

Full Browser Automation

More complex web site automation sometimes requires a full browser to work correctly. This is usually when you are relying on heavily Javascript rendered webpages and more advanced user interaction.

Webdriver is a standard API for controlling a browser. A Webdriver client is a languages API implementation that can talk to a Webdriver server. The Webdriver server launches a full browser instance and converts the API calls into actual browser actions.

Webdriver.io and Selenium Standalone Server will cover most of what you need.
Internet Explorer has a native server available.
Chrome release their own native webdriver server too.

Community
  • 1
  • 1
Matt
  • 51,189
  • 6
  • 117
  • 122
  • Well I don't have an api to call at the moment. this is still being built. This is an older system that basically has an web authentication gate that does several redirects that then set the access tokens in the http request/response headers. I have looked at all of the traffic going back and forth for this system. I will take a look at the headless browsers approach. this looks promising. Thanks! – Eric V Aug 25 '16 at 21:20