How do I scrape data with R when the same class is used repeatedly?

Question

I'm trying to scrape contact info from a website that lists relevant organizations I wish to contact. However, the info I need is repeatedly placed under the same class along with lots of irrelevant info.

My initial idea was to grab each piece of info separately (name, phone, email, website, etc.) and create a table (because that's how the tutorials do it with their perfect example sites).

Unfortunately, everything other than the company name uses the same class (.summaryRecordType). Grabbing everything under that class isn't too bad as the only parts I don't want are "area served". However, I'm not sure how to separate each piece of info and anchor it to the proper company name in a table.

My guess now is that I need to use the wider class (.summaryTitlePrivatePractice) to get company names and contact info while keeping them linked to make a table. But, that makes everything into one solid paragraph of text.

I'd like to get this all into a table that has separate columns for the company names, addresses, phone, email, and website. I don't need any other info; but, if it's easier to leave it in under its own column, that's fine, I just won't use it.

I'm brand new to this and not sure where to go from here. If this would be easier in Python, feel free to give a solution in that language. I'm only using R because I am mildly familiar with it for data visualization. Code I've tried below:

#Loading relevant packages
library(xml2)
library(rvest)
#library(stringr)
#library(dbplyr)

#Website
ementalhealth <- 'https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229'

#Getting the Company Names
CompName <- ementalhealth

CleanCompName <- CompName %>%
  read_html() %>% 
  html_nodes("div.emhTip") %>% 
  html_text()

#Cleaning the Company Names
CleanCompName <- gsub("\n","", CleanCompName)
CleanCompName <- gsub("\t","", CleanCompName)
head(CleanCompName)

#Getting the Contact Data
CompSum <- ementalhealth

CleanCompSum <- CompSum %>%
  read_html() %>% 
  html_nodes(".summaryRecordType") %>%
  html_text()

#Cleaning the Contact Data
CleanCompSum <- gsub("\n","", CleanCompSum)
CleanCompSum <- gsub("\t","", CleanCompSum)
head(CleanCompSum, 50)
#Issue where some companies don't have all the info, or have extra info (Area Served)... and I'm not sure how to link them in a table


#What if I grab everything altogether?
CompCombined <- ementalhealth

CleanCompCombined <- CompCombined %>%
  read_html() %>% 
  html_nodes(".summaryTitlePrivatePractice") %>% 
  html_text()

#Cleaning the Contact Data
CleanCompCombined <- gsub("\n","", CleanCompCombined)
CleanCompCombined <- gsub("\t","", CleanCompCombined)
head(CleanCompCombined, 50)
#Now everything is one big paragraph

QHarr · Accepted Answer · 2021-01-29T09:14:37.023

How difficult this is depends on what info you want from the page. I am working to the assumption you want a dataframe/tibble that details from PointOfCare e.g. 1. Hospitals, through ServiceType e.g. Publicly Funded / Free Services, all the way down into the actual listings details of each service.

There are two immediate problems to overcome if going for all the above info:

The DOM is pretty flat i.e. the PointOfCare info is at same level of DOM as ServiceType and the start of service listings is only 1 level deeper. This means there is no nice logical way to use an HTML parser and select for parent nodes then process children, and still get the desired info mapped for the PointOfCare and ServiceType to each service listing.
There are differing numbers of child nodes holding a given service's info, those with className summaryRecordType, within each listing (ranging between 3 and 5).

① To deal with the first problem I decide to convert the retrieved HTML to a string and split that string into chunks to process. I retrieve the PointOfCare labels and use those to generate the initial blocks settings_blocks:

all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[@class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)

At this point I have e.g. 1. Hospitals in the first block, 2. Inpatient services in the second block and so on.

I then further split each of those chunks by the ServiceTypes:

service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses")

Annoyingly, I had to hardcode as \r\n in the latter string, rather than retrieve from the relevant node html itself, as it was not present otherwise (and therefore match was not found for split).

So, 1. Hospitals when processed would have only a sub-chunk for Publicly Funded / Free Services, whereas 2. Inpatient services would end up split in two Publicly Funded / Free Services and Private Practice Professionals and Commercial Businesses. This all happens in a loop over settings_blocks.

for (i in seq_along(settings_blocks)) {
  r <- r + 1
  point_of_care <- points_of_delivery[[i]]
  splits <- split_points(settings_blocks[[i]])
  nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))

There are a couple of sections with no listings e.g. 3.3 Drop-in centres; in those cases I generate a record as follows:

record <- list(
      point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
      NA_character_, NA_character_, NA_character_, NA_character_, NA_character_

A full record has the following info fields:

list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages)

Once at the lowest level block, nodes_html[[j]], provided there are tables (each table is a listing) I retrieve the info for all the fields of interest:

records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))

② Now, we still have the issue of differing amounts of info in each listing table. However, it turns out one can map what info is present according to how many child nodes with className summaryRecordType are present. The mapping is as follows:

| Child nodes count | BusinessName | ServiceDescription | Address | Tel | Website | AreaServed |
|-------------------|--------------|--------------------|---------|-----|---------|------------|
| 3                 |              |                    | 1       | 2   |         | 3          |
| 4                 | 1            |                    | 2       | 3   |         | 4          |
| 5                 |              | 1                  | 2       | 3   | 4       | 5          |

From column 2 onwards, the number indicates which node holds the info indicated by the column header. As I loop each bottom level chunk I have a helper function that applies this mapping when retrieving the listing info:

record_from_table <- function(table, point_of_care, service_type) {
  info_lines <- table %>% html_nodes(".summaryRecordType")
  Title <- table %>%
    html_node("a") %>%
    html_text() %>%
    trimws()
  Url <- table %>%
    html_node("a") %>%
    html_attr("href") %>%
    url_absolute(link)
  if (length(info_lines) == 3) {
    BusinessName <- NA_character_
    ServiceDescription <- NA_character_
    Address <- info_lines[1] %>%
      html_text() %>%
      trimws()  # etc.........

I pass in PointOfCare and ServiceType so they are mapped to the record level. By the end of

for (i in seq_along(settings_blocks)) {...}

I have a list of records/listings. I then do some tidying of the records. I return tibbles, so I can later use map_dfr to generate my final dataframe structure:

records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)

With the final dataframe structure in place and populated I set about tidying up some other things I noticed:

① During my final_blocks function the encoding of UTF-8 input strings was getting garbled.

For example, the following correctly UTF-8 encoded string (on Windows OS):

Autisme-Asperger-Québec (AAQc)

Ended up as:

Autisme-Asperger-QuÃ©bec (AAQc)

A colleague pointed out that it was actually tidy_html() at fault; and that this was particular to Windows OS - ran fine on Linux - due to the default encoding for Windows. The mangling is called Mojibake. He pointed me to the following links for further reading:

To quote only a small part of the latter link:

The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx

“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary representation is: 00000000 11101001. “é” is therefore between 127 and 2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8 representation is 11000011 10101001.

Now let’s imagine that this “é” sits in a document that’s believed to be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters are coded on 8 bits, so the 2-byte character “é” will become 2 1-byte-long latin-1 characters. The first character is 11000011, i.e. C3, which, when checking the table corresponds to “Ã” (U+00C3); the second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).

The colleague pointed out I could fix this by converting it from UTF-8 to latin twice because UTF-8 characters have been encoded in UTF-8 again.

iconv(iconv(<mangled_string>, from = "UTF-8", to = "latin1"), "UTF-8", "latin1")

I had introduced tidy_html to ensure sliced text ended up being parsable.

② I chose not to try and fix the mangled strings as per the description above. Instead, as my final dataframe provided the skeleton for where all my data resided, I simply went back to the original HTMLDocument and parsed out the info again (in UTF-8) and mapped onto my dataframe. This had the added benefit of preserving spacing between certain words and line breaks.

titles <- page %>%
  html_nodes(".emhTip a:nth-of-type(1)") %>%
  html_text()
descriptions <- page %>%
  html_nodes(".emhTip + .summaryRecordType") %>%
  html_text() %>%
  trimws()
mixed_nodes <- page %>%
  html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
  html_text() %>%
  trimws()

r <- r1 <- 0

# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
  if (!is.na(listings$Title[i])) {
    r <- r + 1
    listings$Title[i] <- titles[r]
    if (!is.na(listings$BusinessName[i])) {
      listings$BusinessName[i] <- mixed_nodes[r]
    }
  }
  if (!is.na(listings$ServiceDescription[i])) {
    r1 <- r1 + 1
    listings$ServiceDescription[i] <- descriptions[r1]
  }
}

Last, but not least, I noticed that some service descriptions had a ...more in the listing, where an additional XHR request would be required to gather the full description. I decided, in case you wanted to obtain the full descriptions, in those cases, to provide a helper function to retrieve these:

expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()

listings$ServiceDescription <- expanded_descriptions

Now, that did slow the run-time as I needed to add some delays in to ensure connections were opened and closed properly.

The full code is below, including a couple of attributions where I borrowed a few lines from other SO contributors.

R:

library(stringr)
library(rvest)
library(htmltidy)
library(tidyverse)

point_of_delivery <- function(node) {
  pod <- node %>%
    html_node(".classyHeading") %>%
    html_text() %>%
    str_split("\n") %>%
    unlist() %>%
    tail(1) %>%
    trimws() %>%
    str_replace("\xa0", " ")
  return(pod)
}


delivery_matches <- function(node) {
  dm <- node %>%
    html_node(".classyHeading") %>%
    html_text() %>%
    str_split("\n") %>%
    unlist() %>%
    tail(1)
  return(dm)
}


get_blocks <- function(a_list) {
  results <- vector("list", length(a_list))
  for (i in seq_along(a_list)) {
    start_pos <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i])))[, 1]
    if (i == length(a_list)) {
      block <- substring(all_text, start_pos, nchar(all_text)) %>% tidy_html()
    } else {
      next_start <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i + 1])))[, 1]
      block <- substring(all_text, start_pos, next_start) %>% tidy_html()
    }
    results[[i]] <- block
  }
  return(results)
}


split_points <- function(node) {
  res <- map(service_types, ~ str_locate_all(node %>% toString(), .)) %>% unlist()
  if (length(res) == 0) {
    return(c(NA_integer_))
  } else {
    return(res[seq(1, length(res), 2)]) # https://stackoverflow.com/a/34100009/6241235 @stas g
  }
}


final_blocks <- function(splits, block) {
  results <- vector("list", length(splits))
  if (length(splits) == 1) {
    res <- ifelse(is.na(splits), splits, block %>% tidy_html())
  } else {
    for (i in seq_along(splits)) {
      start_pos <- splits[i]
      if (i == length(splits)) {
        res <- substring(block, start_pos, nchar(block)) %>% tidy_html()
      } else {
        next_start <- splits[i + 1]
        res <- substring(block, start_pos, next_start) %>% tidy_html()
      }
      results[i] <- res
    }
    return(results)
  }
}


record_from_table <- function(table, point_of_care, service_type) {
  info_lines <- table %>% html_nodes(".summaryRecordType")
  Title <- table %>%
    html_node("a") %>%
    html_text() %>%
    trimws()
  Url <- table %>%
    html_node("a") %>%
    html_attr("href") %>%
    url_absolute(link)
  if (length(info_lines) == 3) {
    BusinessName <- NA_character_
    ServiceDescription <- NA_character_
    Address <- info_lines[1] %>%
      html_text() %>%
      trimws()
    Tel <- info_lines[2] %>%
      html_text() %>%
      trimws()
    Website <- NA_character_
    AreaServed <- info_lines[3] %>%
      html_text() %>%
      trimws()
  } else if (length(info_lines) == 4) {
    BusinessName <- info_lines[1] %>%
      html_text() %>%
      trimws()
    ServiceDescription <- NA_character_
    Address <- info_lines[2] %>%
      html_text() %>%
      trimws()
    Tel <- info_lines[3] %>%
      html_text() %>%
      trimws()
    Website <- NA_character_
    AreaServed <- info_lines[4] %>%
      html_text() %>%
      trimws()
  } else {
    BusinessName <- NA_character_
    ServiceDescription <- info_lines[1] %>%
      html_text() %>%
      trimws()
    Address <- info_lines[2] %>%
      html_text() %>%
      trimws()
    Tel <- info_lines[3] %>%
      html_text() %>%
      trimws()
    Website <- info_lines[4] %>%
      html_text() %>%
      trimws()
    AreaServed <- info_lines[5] %>%
      html_text() %>%
      trimws()
  }
  Ages <- get_age(table)
  return(list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages))
}

get_age <- function(table) {
  tryCatch(table %>% html_node(".summaryTitlePrivatePractice + td") %>%
    html_text() %>% str_replace("Add to Info Cart", "") %>% trimws(), error = function(e) {
    return(NA_character_)
  })
}


clean_record <- function(a_record) {
  a_record[[7]] <- str_replace(a_record[[7]], " Map", "")
  a_record[[10]] <- str_replace(a_record[[10]], "Area[s]? Served: ", "")
  a_record <- set_names(a_record, c("PointOfCare", "ServiceType", "Title", "Url", "BusinessName", "ServiceDescription", "Address", "Tel", "Website", "AreaServed", "Ages"))
  return(a_record %>% as_tibble())
}


full_description <- function(current_description, current_url) {
  if (grepl(" \\.\\.\\.", current_description)) {
    content <- read_html(current_url, encoding = "UTF-8") %>%
      html_node(".recordSummary") %>%
      html_text() %>%
      trimws()
    CatchupPause(.1)
  } else {
    content <- gsub("\\s+more", "", current_description) %>% trimws()
  }
  return(content)
}


CatchupPause <- function(Secs) { # https://stackoverflow.com/a/52758758 @nm200
  Sys.sleep(Secs) # pause to let connection work
  closeAllConnections()
  gc()
}


link <- "https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229&recordType=1&sortBy=0"
page <- read_html(link, encoding = "UTF-8")
all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[@class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)
service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses") # annoying have to hardcode as \r\n not present in node output

records <- vector("list", 1000) # > max expected num entries when lists unnested
r <- 0

# Generate all records for the final tibble
for (i in seq_along(settings_blocks)) {
  r <- r + 1
  point_of_care <- points_of_delivery[[i]]
  splits <- split_points(settings_blocks[[i]])
  nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))

  if (is.na(nodes_html)[1]) {
    record <- list(
      point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
      NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
    )
    records[[r]] <- list(record)
  } else {
    for (j in seq_along(nodes_html)) {
      service_type <- if_else(str_detect(nodes_html[[j]], service_types[1]), service_types[1], service_types[2])
      tables <- nodes_html[[j]] %>%
        read_html() %>%
        html_nodes(".condensedViewTable")
      records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))
      r <- r + 1
    }
  }
}

records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)
#

## Partly due to default Windows encoding, and lack of UTF-8 support in R, causing Mojibake via earlier tidy_html(), we grab the properly encoded info
## to overwrite the mangled text |text lacking spaces
titles <- page %>%
  html_nodes(".emhTip a:nth-of-type(1)") %>%
  html_text()
descriptions <- page %>%
  html_nodes(".emhTip + .summaryRecordType") %>%
  html_text() %>%
  trimws()
mixed_nodes <- page %>%
  html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
  html_text() %>%
  trimws()

r <- r1 <- 0

# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
  if (!is.na(listings$Title[i])) {
    r <- r + 1
    listings$Title[i] <- titles[r]
    if (!is.na(listings$BusinessName[i])) {
      listings$BusinessName[i] <- mixed_nodes[r]
    }
  }
  if (!is.na(listings$ServiceDescription[i])) {
    r1 <- r1 + 1
    listings$ServiceDescription[i] <- descriptions[r1]
  }
}


# descriptions_to_expand <- dplyr::filter(listings, grepl(" \\.\\.\\.", ServiceDescription))
expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()
listings$ServiceDescription <- expanded_descriptions
write.csv(listings, "~/data.csv", na = "")

Some example rows of output:

_{click on image to enlarge}

This is incredible, thanks! Ngl though, I ended up throwing everything into a table half-baked and manually edited in Excel to get what I needed a couple of days ago. — speens, Jan 28 '21 at 18:38

speens · Answer 2 · 2021-01-28T19:11:43.893

Although QHarr's answer is flawless, I thought I would follow up with what I actually did (before getting their answer) for anyone viewing this question in the future that, like me, wouldn't have the knowledge to modify QHarr's response to fit their own situation.

Because I was having issues grabbing the relevant data separately from the site, I just grabbed everything together from .summaryTitlePrivatePractice and used the \n\r\t's that came with it as markers to split the data. Everything in the included code (below) should be self-explanatory, but I'll explain the cleaning section. There were random strings of \t\r\n in between each piece I needed and I could see that \n never repeated more than 4 times by checking the output at that stage. So, I converted that all to a single \n for each instance and used that as my marker for splitting the data.

This solution wasn't ideal as not every entry (row) had the same amount of data, so I had to manually go through and move data around to align the columns. This was ok in this case because most rows had the same amount of columns and there were only 150 rows, but it wouldn't work for larger data sets. I also used some functions in Excel to make bulk edits where applicable that sped up the process. R Code below:

library(xml2)
library(rvest)
library(tidyverse)

#Website
ementalhealth <- 'https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229'

#What if I grab everything altogether?
CompCombined <- ementalhealth

CleanCompCombined <- CompCombined %>%
  read_html() %>% 
  html_nodes(".summaryTitlePrivatePractice") %>% 
  html_text()

#Cleaning the Everything Data
CleanCompCombined <- gsub("\t","", CleanCompCombined)
CleanCompCombined <- gsub("\r","", CleanCompCombined)
CleanCompCombined <- gsub("\n\n\n\n","\n", CleanCompCombined)
CleanCompCombined <- gsub("\n\n\n","\n", CleanCompCombined)
CleanCompCombined <- gsub("\n\n","\n", CleanCompCombined)
#Now everything is one big paragraph

#removing duplicates
dedupedsplit <- unique(CleanCompCombined)

#splits
col1 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 1)
col2 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 2)
col3 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 3)
col4 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 4)
col5 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 5)
col6 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 6)
col7 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 7)
col8 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 8)
col9 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 9)
col10 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 10)

#tablecraft
blahtable <- data.frame(col1 = col1,
                        col2 = col2,
                        col3 = col3,
                        col4 = col4,
                        col5 = col5,
                        col6 = col6,
                        col7 = col7,
                        col8 = col8,
                        col9 = col9,
                        col10 = col10)

#output
blahtable```

How do I scrape data with R when the same class is used repeatedly?

2 Answers2