Scraping tables from finviz in R-CodePudding

I would like to extract from www.finviz.com the quarterly tables: income statement, balance sheet and cash flow. I am interested in multiple stocks, but to automate it I would have to know how to scrape a stock. An example is the following:

https://finviz.com/quote.ashx?t=A&ty=c&p=d&b=1

We can see these tables at the bottom of the page. On this page, there are other tables that we can scrape with rvest, but this is a different case and I have not been able to scrape the aforementioned tables.

I would appreciate if someone could help me with this issue.

CodePudding user response：

Here's an example using rvest and RSelenium. Note that the rsDriver part can be somewhat tricky: a compatible browser (Firefox for me) needs to be installed and the port needs not to be occupied.

Code

library(rvest)
library(tidyverse)
library(RSelenium)

url <- "https://finviz.com/quote.ashx?t=A&ty=c&p=d&b=1"

# setup Selenium
# (this will start a firefox window; may be necessary to
# accept privacy preferences or similar manually!)
rD <- rsDriver(port = 4450L, browser = "firefox")
remDr <- rD$client

# navigate to the url
remDr$navigate(url)

# define selector for buttons in cookie preferences overlay
b_read_selector <- ".lcqSKB"
# find button and click (twice)
for(i in 1:2) {
  b_read <- remDr$findElements("css selector", b_read_selector)[[1]]
  b_read$clickElement()
  Sys.sleep(1)
}

# define selectors for the quarterly link and the table
q_link_selector <- "#statements > table.fullview-links > tbody > tr > td:nth-child(2) > a:nth-child(2)"
tab <- "#statements > .snapshot-table2"

# find the 'quarterly' link and click
q_link <- remDr$findElements("css selector", q_link_selector)[[1]]
q_link$clickElement()

# get the page source and parse html
pg <- remDr$getPageSource()[[1]] %>% read_html()

# then parse the table to tibble
pg %>% html_node(tab) %>% html_table(header = T) %>% select(-2)

# close the firefox session
rD$server$stop()

Note that I drop the second column (the one with the bar plots) in the second last line because images are not scraped so the column is empty anyway.

Result

# A tibble: 20 × 9
   `Period End Date` `7/31/2021` `4/30/2021` `1/31/2021` `10/31/2020`  `7/31/2020`  `4/30/2020`  `1/31/2020` 
   <chr>             <chr>       <chr>       <chr>       <chr>         <chr>        <chr>        <chr>       
 1 Period Length     "3 Months"  "3 Months"  "3 Months"  3 Months      3 Months     3 Months     3 Months    
 2 Total Revenue     "1,586.00"  "1,525.00"  "1,548.00"  Upgrade your… Upgrade you… Upgrade you… Upgrade you…
 3 Cost of Revenue   "734.00"    "708.00"    "710.00"    Upgrade your… Upgrade you… Upgrade you… Upgrade you…
 4 Gross Profit      "852.00"    "817.00"    "838.00"    Upgrade your… Upgrade you… Upgrade you… Upgrade you…

To get balance sheet or cash flow data you need to click the respective link before parsing the page source which works analogously as for the quarterly link:

balance_link_selector <- "#statements > table.fullview-links > tbody > tr > td:nth-child(1) > a:nth-child(2)"
cash_link_selector <- "#statements > table.fullview-links > tbody > tr > td:nth-child(1) > a:nth-child(3)"

# e.g. to get balance sheet:
balance_link <- remDr$findElements("css selector", balance_link_selector)[[1]]

# click balance sheet link
balance_link$clickElement()

... then proceed as above.

This should put you in the position to write functions/loops that automize the process. Note that it will be sufficient to run the code to close the cookie preferences overlay once after the first call to remDr$navigate(url) because cookies are set for the whole session, i.e. you can navigate to and scrape from multiple urls afterwards without rerunning the loop.