0

I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.

By using below code, that I found here, I can get data from and specific subsector from Finviz screener:

library (rvest)    

url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")

tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>% 
  html_table(fill=TRUE) %>% data.frame()

head(screener)

It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.

How could I create in this case the structure?

i=0
for(z in ...){

Many thanks in advance for your help.

CarlosFC
  • 13
  • 3
  • .[11] is consistent in all links? I have checked it for table &1, &r21 it's not present. – rj-nirbhay May 09 '20 at 20:02
  • Yes, the link https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry&r=1 works. Maybe you cannot access it if you're not registered... – CarlosFC May 09 '20 at 20:19
  • I am taking about `%>% .[11] %>% ` this step. the number 11 is working fine for the current [url](https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry) but not for &1 url – rj-nirbhay May 09 '20 at 20:22
  • I don't know why, but it works for me. I just checked again. – CarlosFC May 09 '20 at 20:30
  • the link is working fine, the concerned it `tables% html_nodes("table") %>% .[11]` the output will be `screener {xml_nodeset (1)} [1] \n
    – rj-nirbhay May 09 '20 at 20:34

1 Answers1

0

Update script based on new table number and link:

library (rvest) 
library(stringr)

url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"

TableList<-c("1","21","41","61") # table list 

GetData<-function(URL,tableNo){

  cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')

  tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
  screener <- tables %>% 
    html_nodes("table") %>% 
   .[17] %>% 
    html_table(fill=TRUE) %>% 
    data.frame()
  return(screener)
} 


AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list

Here is one approach using stringr and lapply:

library (rvest) 
library(stringr)

url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url

TableList<-c("1","21","41","61") # table number list 

GetData<-function(URL,tableNo){

  cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')

  tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers

  screener <- tables %>% 
    html_nodes("table") %>% 
    .[11] %>% # check
    html_table(fill=TRUE) %>% 
    data.frame()
  return(screener)
} 

AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes

However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.

rj-nirbhay
  • 614
  • 5
  • 19
  • Thank you very much Nirbhay. Yes, that table number is not working with the function.I will try to find another way... – CarlosFC May 09 '20 at 22:12
  • The table numbers are not working as there is no data at 11 number tables in the URLs. Checked table number `17` has the data. So just replace the 11 with 17. It will work. In the next comment, I have given the output of head(screener,2). – rj-nirbhay May 10 '20 at 04:54
  • `head(screener,2) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 1 No. Ticker Company Sector Industry Country Market Cap P/E Price Change Volume 2 1 AIMC Altra Industrial Motion Corp. Industrials Specialty Industrial Machinery USA 1.99B - 28.12 5.00% 454,656` – rj-nirbhay May 10 '20 at 04:56
  • Thanks again. I have tried with [17] but it shows only the second table to me. I don't know if I am doing something wrong. – CarlosFC May 10 '20 at 09:17
  • checked again it is showing correct data to me, what is the output of `lapply(AllData,length)` for you? In my case AllData has 11 elements in each . – rj-nirbhay May 10 '20 at 09:27
  • Yes, it gives 11 data in each one to me as well, but if I type for example head(screener,25) it shows only rows from 21 to 40 (second url). – CarlosFC May 10 '20 at 09:55
  • Understood your concern. Updating answer in few mins. The issue is the reference link. – rj-nirbhay May 10 '20 at 10:22
  • @CarlosFC Refer the updated script, It should give you desired results. – rj-nirbhay May 10 '20 at 10:29
  • Thanks a lot. It works very well now. Just two questions. Is there another way to avoid finding the table number when working with html_nodes? How can I convert Alldata to an unique table? – CarlosFC May 10 '20 at 11:37
  • @CarlosFC you can use `AllDataCombined – rj-nirbhay May 10 '20 at 11:48
  • @CarlosFC Glad to help. As my suggestion suitably answers your question, click the tick mark to accept it as the chosen answer and upvote it. – rj-nirbhay May 10 '20 at 11:51
  • After working more with the code you gave me (it works very well) I realized that the TableList is fine when I need to get not many records but when I need to scrap for example 3751 records (https://finviz.com/screener.ashx?v=111&f=sec_financial) I cannot type all pages number because it would be endless. I was wondering if it would be another way to avoid typing all page numbers. Each page number is previous+20. – CarlosFC May 18 '20 at 17:03
  • You can look for RSelenium package for better uses – rj-nirbhay May 19 '20 at 03:34