0

I'm trying to scrape the $('a[href^="mailto:"]') of this website: https://celsius.network/

When I go to the browser console and run that, I get a link so I know it's there.

The issue is that my request (using the Axios library) returns the DOM before javascript is loaded. I've set the User-Agent, but it looks like it's not working.

const axiosClient = () =>
  axios.create({
    headers: {
      "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4"
    },
    timeout: 10000
  });


axiosClient()
  .get("https://celsius.network")
  .then(({ data }) => {
    console.log("DATAAAAAAAA: ", data);
  })

This is returning the original HTML, with the body:

<body>
  <div id="app"> </div>
  ....

instead of the one that's fully loaded after all the javascript has manipulated the DOM.

P.S. I am doing this through firebase functions, so I think there are limits to what I can install.

UPDATE

const findEmail = url =>
  new Promise((resolve, reject) => {
     // here!
  });
bigpotato
  • 22,922
  • 46
  • 147
  • 286

1 Answers1

0

Your request approach isn't enough to emulate what you'd expect while visiting a page in your browser. While there are some choices out there, puppeteer may be a candidate for the job.

Most things that you can do manually in the browser can be done using Puppeteer!

Check out the following...

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://celsius.network/');
  const textContent = await page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent);

  console.log(textContent); // presale@celsius.network

  browser.close();
})();

I'm not totally clear on your constraints...

there are limits to what I can install

If you have axios, I'd assume you can install this npm package?


Per your update, puppeteer can also be crafted via the promise api. The following should do it for you...

const findEmail = url =>
  new Promise((resolve, reject) => {
    puppeteer.launch().then((browser) => {
      browser.newPage().then((page) => {
        page.goto('https://celsius.network/').then(() => {
          page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent).then((element) => {
            resolve(element);
            browser.close();
          });
        });
      });
    });
  });

findEmail().then((email) => {
  console.log(email); // presale@celsius.network
});
scniro
  • 15,980
  • 8
  • 54
  • 101
  • wow I'd like to try this out! is there a way to put this inside a Promise instead of the `(async() => {})()` thing you have going – bigpotato Nov 09 '17 at 04:21
  • I updated the question to describe what I'm talking about with the Promise – bigpotato Nov 09 '17 at 04:22
  • @Edmund I have updated my answer for you, please take a look and let me know if this is what you were wanting? – scniro Nov 09 '17 at 13:12