0

I have vector of URLs, from where I need to get some text.

I use rvest and this code:

r <- getURL(queries[2])

pages_data <- read_html(r) %>% 
  html_nodes(".bloko-button.HH-Pager-Control") %>%
  html_text()

In this case I get:

character(0)

But if I will put character string instead of vector element it will work.

url <- "https://kazan.hh.ru/search/vacancy?L_is_autosearch=false&area=2&clusters=true&enable_snippets=true&no_magic=true&only_with_salary=true&search_field=name&text=продавец-консультант"
r <- getURL(url)

pages_data <- read_html(r) %>% 
  html_nodes(".bloko-button.HH-Pager-Control") %>%
  html_text()
[1] "2"      "3"      "4"      "5"      "74"     "дальше"

But queries[2] == url is TRUE. What's the problem?

Function to get queries:

start_url <- "https://kazan.hh.ru/search/vacancy?L_is_autosearch=false&area=2&clusters=true&enable_snippets=true&no_magic=true&only_with_salary=true&search_field=name"
professions <- c("frontend", "продавец-консультант", "менеджер+по+персоналу", "слесарь")

queries <- str_c(start_url, "&text=", professions)
Halva
  • 103
  • 3

1 Answers1

0

You need to use URLencode() to wrap the queries. More on URLencode here.

library(RCurl)
r <- getURL(URLencode(queries[2]))

pages_data <- read_html(r) %>% 
  html_nodes(".bloko-button.HH-Pager-Control") %>%
  html_text()

pages_data

By the way, the reason the first query succeeded and the second one didn't was that the first didn't contain any Cyrillic characters. Using URLencode() on all URL's is a good safe practice.

ravic_
  • 1,351
  • 6
  • 11
  • It's worked, thank you! But I didnt quite understand note about symbols. Aren't they both contains cyrillic symbols? – Halva Nov 18 '19 at 18:59
  • Glad it worked. If you inspect the `queries` object in your environment, you'll find `query[1]` doesn't contain any Cyrillic characters (using `profession=frontend"`), but all the others do. – ravic_ Nov 18 '19 at 19:04
  • Yes. But in my examples, first with ```url```object, its with cyrillic symbols and ```read_html``` works without ```URLEncode```. And in the second example with vector element ```queries[2]``` also with cyrillic symbols ```read_html``` doesn work and here your tip helped. – Halva Nov 18 '19 at 19:22