Javascript asynchronous web crawler

Question

I have an async function that reads a list of websites from a csv file.

async function readCSV(){
  const fileStream = fs.createReadStream('./topm.csv');

  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });


  for await (const line of rl) {
    var currentline=line.split(",");
    
    var res_server_http = await check_page("http://www."+currentline[1]) 
  }

}

Every time that I read a site I call check_page function that do some operations. Every time that I have one I wait its ending before start to new site.

async function check_page(web_page){
     // do some operations....

}

Up this point it works correctly, but now I have to integrate my code with a web-crawler. Inside readCSV function I have to call it for every site that I read and for each one I should call check_page function.

Now I've edit readCSV in this way:

const fileStream = fs.createReadStream('./topm.csv');

  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

for await (const line of rl) {
    var currentline=line.split(",");

    await (new Promise( resolve => {
      new Crawler().configure({depth: 2})
      .crawl(site, async (page) => {
          //console.log(page.url);
          var res_server_http = await check_page("http://www."+currentline[1])

          // Resolve here
          resolve();
      });
    }));
  
  }

I'm using this code for web-crawler: https://www.npmjs.com/package/js-crawler

This function now doesn't work because it is not async. How can I change my code ?

Now I've this error:

(node:907) UnhandledPromiseRejectionWarning: ReferenceError: site is not defined
at /Users/francesco/Desktop/tesi/crawler.js:55:14
at new Promise (<anonymous>)
at readCSV (/Users/francesco/Desktop/tesi/crawler.js:53:12)
at processTicksAndRejections (internal/process/task_queues.js:97:5)

(node:907) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 2) (node:907) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Write a function that wraps the crawler part into a promise so that you can use it with `async/await`. See [How do I convert an existing callback API to promises?](https://stackoverflow.com/q/22519784/218196) — Felix Kling, Feb 09 '21 at 09:50

emi · Answer 1 · 2021-02-09T10:12:14.307

0

Add a Promise:

  for await (const line of rl) {
    var currentline=line.split(",");

    await (new Promise( resolve => {
      new Crawler().configure({depth: 2})
      .crawl(site, async (page) => {
          //console.log(page.url);
          var res_server_http = await check_page("http://www."+currentline[1])

          // Resolve here
          resolve();
      });
    }));
  }

edited Feb 09 '21 at 10:12

answered Feb 09 '21 at 10:01

emi

2,167
1
10
19

I have the same error: SyntaxError: await is only valid in async function – Fra96 Feb 09 '21 at 10:05
Sorry, just fixed – emi Feb 09 '21 at 10:05
`res_server_http` is not used. How are you going to use it? – emi Feb 09 '21 at 10:07
We didn't know that was your error. This is because `await` keyword can only be used inside a function declared with `async`. In your original code, the callback used for `crawl` does not have the `async` keyword, hence, it cannot `await` the result from `check_page`. – emi Feb 09 '21 at 10:10
I don't use it for the moment – Fra96 Feb 09 '21 at 10:12
Fixed the code taking care of this error. – emi Feb 09 '21 at 10:12
I tried, but now I've the error that I put at the end of my question. – Fra96 Feb 09 '21 at 10:21
You are not catching errors from `crawl` and `check_page`. You need to wrap them inside a `try {} catch(err) {}` block and handle it correctly (at least, log errors to console). – emi Feb 09 '21 at 10:50

Javascript asynchronous web crawler

1 Answers1