8

So I'm trying to scrape all the concerts in the boxed off area in the picture below:

https://i.stack.imgur.com/7QIMM.jpg

The problem is the list only presents the first 10 options until you scroll down in that specific div to the bottom, and then it dynamically presents more until there are no more results. I tried following the link below's answer but couldn't get it to scroll down to present all the 'concerts':

How to scroll inside a div with Puppeteer?

Here's my basic code:

const browser = await puppeteerExtra.launch({ args: [                
    '--no-sandbox'                                                  
    ]});

async function functionName() {
    const page = await browser.newPage();
    await preparePageForTests(page);
    page.once('load', () => console.log('Page loaded!'));
    await page.goto(`https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail`);   

    const resultList = await page.waitForSelector(".odIJnf"); 
    const scrollableSection = await page.waitForSelector("#Q5Vznb");    //I think this is the div that contains all the concert items.
    const results = await page.$$(".odIJnf");  //this needs to be iterable to be used in the for loop

//this is where I'd like to scroll down the div all the way to the bottom

    for (let i = 0; i < results.length; i++) {
      const result = await (await results[i].getProperty('innerText')).jsonValue();
      console.log(result)
    }
}
nickcoding2
  • 111
  • 1
  • 9

2 Answers2

3

Try this to scroll down on the list of concerts. You can keep looping until the number of results stops increasing, or you find the concert you are looking for:

await page.evaluate(()=>{
  document.querySelector("#Q5Vznb").scrollIntoView(false);
});
Benny
  • 156
  • 9
  • Hi Benny, I think the div might be wrong. So far, I've tried #Q5Vznb, .MZpzq, and .uAAqtb--none have worked so far at getting more than the original amount of loaded .odIJnf elements. Do you have any suggestions for other divs I should try? Thank you! – nickcoding2 May 11 '21 at 19:05
  • I also tried using '#immersive_desktop_root > div.drPJve > div.YbRs3e > div:nth-child(2) > div.UbEfxe.uAAqtb > div.MZpzq.gws-horizon-textlists__tl-no-filters.TWKvJb' and '#immersive_desktop_root > div.drPJve > div.YbRs3e > div:nth-child(2) > div.UbEfxe.uAAqtb' as the arguments for querySelector(). Neither worked unfortunately. – nickcoding2 May 11 '21 at 19:39
  • I think it is right. I just went to that website in Google Chrome console, and this worked (right click, then click on Inspect; or use the shortcut Ctrl+Shift+I): document.querySelectorAll('.odIJnf').length >> 20 document.querySelector("#Q5Vznb").scrollIntoView(false); >> undefined document.querySelectorAll('.odIJnf').length >> 30 So the number of concerts increased by 10 after the scroll command. – Benny May 11 '21 at 21:11
  • does it make a difference that I'm using await page.$$(".odIJnf") instead of document.querySelectorAll ? – nickcoding2 May 11 '21 at 21:15
  • The await page.$$ is executed in Node.js. When you call page.evaluate(), that function is executed in the browser console, so you can run querySelector. This is a good explanation of the difference: https://stackoverflow.com/questions/55664420/page-evaluate-vs-puppeteer-methods – Benny May 11 '21 at 21:18
  • I would've thought they both return an array of elements. page.$$(".odIJnf") is iterable and I end up accessing it in a for loop with await (await results[i].getProperty('innerText')).jsonValue(); -- for some reason, The document.querySelectorAll() version isn't utterable. I was reading this is because it returns an object, is there any way I can have it return the same thing that the $$ function returns? – nickcoding2 May 12 '21 at 00:48
  • What do you think? – nickcoding2 May 18 '21 at 16:35
  • I think you need to update your code above or ask as a new question. Not sure where you are now. – Benny May 18 '21 at 18:23
  • I updated my code with how I use the "results." Essentially, if I do document.querySelectorAll() to select the elements, it doesn't return an iterable array in the same way that page.$$ does... – nickcoding2 May 19 '21 at 00:39
1

As you mention in your question, when you run page.$$, you get back an array of ElementHandle. From Puppeteer's documentation:

ElementHandle represents an in-page DOM element. ElementHandles can be created with the page.$ method.

This means you can iterate over them, but you also have to run evaluate() or $eval() over each element to access the DOM element.

I see from your snippet that you are trying to access the parent div that handles the list scroll event. The problem is that this page seems to be using auto-generated classes and ids. This might make your code brittle or not work properly. It would be best to try and access the ul, li, div's direct.

I've created this snippet that can get ITEMS amounts of concerts from the site:

const puppeteer = require('puppeteer')

/**
 * Constants
 */
const ITEMS = process.env.ITEMS   || 50
const URL   = process.env.URL     || "https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail"

/**
 * Main
 */
main()
  .then( ()    => console.log("Done"))
  .catch((err) => console.error(err))

/**
 * Functions
 */
async function main() {
  const browser = await puppeteer.launch({ args: ["--no-sandbox"] })
  const page = await browser.newPage()
  
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
  await page.goto(URL)
 
  const results = await getResults(page)
  console.log(results)
  
  await browser.close()
}

async function getResults(page) {
  await page.waitForSelector("ul")
  const ul  = (await page.$$("ul"))[0]
  const div = (await ul.$x("../../.."))[0]
  const results = []
  
  const recurse = async () => {
    // Recurse exit clause
    if (ITEMS <= results.length) {
      return
    }

    const $lis = await page.$$("li")
    // Slicing this way will avoid duplicating the result. It also has
    // the benefit of not having to handle the refresh interval until
    // new concerts are available.
    const lis = $lis.slice(results.length, Math.Infinity)
    for (let li of lis) {
      const result = await li.evaluate(node => node.innerText)
      results.push(result)
    }
    // Move the scroll of the parent-parent-parent div to the bottom
    await div.evaluate(node => node.scrollTo(0, node.scrollHeight))
    await recurse()
  }
  // Start the recursive function
  await recurse()
 
  return results
}

By studying the page structure, we see that the ul for the list is nested in three divs deep from the div that handles the scroll. We also know that there are only two uls on the page, and the first is the one we want. That is what we do on these lines:

  const ul  = (await page.$$("ul"))[0]
  const div = (await ul.$x("../../.."))[0]

The $x function evaluates the XPath expression relative to the document as its context node*. It allows us to traverse the DOM tree until we find the div that we need. We then run a recursive function until we get the items that we want.

guzmonne
  • 1,967
  • 1
  • 12
  • 19