Home > Mobile >  rvest not capturing the entire table
rvest not capturing the entire table

Time:12-14

Hi I would like to scrape a single table containing 100 rows, however with rvest it only seem to get up to 20 rows and it stops. Interestingly it captures the first column for the entire table however after row 20 the rest of the columns are NA

library(rvest)
library ( xml2)
html <-  rvest::read_html("https://coinmarketcap.com/historical/20150621/")
tables <- html_nodes(html, "table")

df = as.data.frame( rvest:: html_table(tables[[3]], fill = TRUE)   )
df = df[ , 1:10]
df[1:25,  ]

this is how the table looks like

> df[1:25,  ]
   Rank             Name Symbol        Market Cap       Price   Circulating Supply   Volume (24h)   % 1h  % 24h   % 7d
1     1       BTCBitcoin    BTC $3,488,111,052.52     $243.94       14,298,800 BTC $10,600,886.00 -0.09% -0.39%  4.33%
2     2           XRPXRP    XRP   $329,106,281.79    $0.01031 31,908,551,587 XRP *    $564,946.56  0.68% -6.49% 26.52%
3     3      LTCLitecoin    LTC   $121,255,276.52       $3.02       40,119,404 LTC  $3,196,087.25  0.66% -0.02% 50.72%
4     4     DOGEDogecoin   DOGE    $20,882,626.13  $0.0002091  99,890,370,337 DOGE    $345,750.50  0.33% -0.46% 25.29%
5     5     BTSBitShares    BTS    $19,410,447.59   $0.007727  2,511,953,117 BTS *     $66,206.36 -1.53% -3.65% 12.20%
6     6       XLMStellar    XLM    $17,058,468.94   $0.003526  4,837,354,256 XLM *     $25,278.98 -2.85% -4.09%  8.34%
7     7         DASHDash   DASH    $15,581,959.93       $2.84       5,482,231 DASH     $42,407.43 -0.17% -1.17%  1.37%
8     8           NXTNxt    NXT    $13,625,080.25    $0.01363    999,997,096 NXT *     $32,074.26  0.99% -3.74% 15.89%
9     9         BANXBanx   BANX     $9,648,845.01       $1.64     5,894,665 BANX *     $15,804.05 -0.11% -0.41%  4.33%
10   10      PPCPeercoin    PPC     $8,857,457.26     $0.3949       22,428,765 PPC     $63,627.21 -0.46% -5.40% 21.14%
11   11 MAIDMaidSafeCoin   MAID     $8,112,629.90    $0.01793   452,552,412 MAID *     $11,125.53 -0.65% -0.56%  7.06%
12   12      NMCNamecoin    NMC     $5,681,492.39     $0.4815       11,800,400 NMC     $16,962.83 -0.99% -4.69% 43.39%
13   13      BCNBytecoin    BCN     $5,086,827.18 $0.00002924  173,955,598,772 BCN      $5,500.92  0.93%  2.81%  2.53%
14   14        XMRMonero    XMR     $4,286,720.12     $0.5233        8,192,114 XMR     $20,025.62 -1.03% -2.23%  5.73%
15   15     BLKBlackCoin    BLK     $3,932,944.75    $0.05248     74,938,648 BLK *    $212,834.00  1.26% -3.55% 42.16%
16   16  XCPCounterparty    XCP     $3,358,114.93       $1.27      2,640,365 XCP *      $2,235.02 -0.09%  3.81% -6.94%
17   17      VTCVertcoin    VTC     $3,264,822.95     $0.2048       15,941,100 VTC     $35,518.47 -2.41%  2.72% 32.79%
18   18        YBCYbCoin    YBC     $3,161,465.76       $1.05      3,000,000 YBC *     $54,359.75  0.11%  2.52% 15.12%
19   19     MONAMonaCoin   MONA     $2,993,610.25     $0.1452      20,619,400 MONA      $8,199.22 -0.88%  3.92% -8.74%
20   20    UNITYSuperNET  UNITY     $2,675,341.46       $3.28      816,061 UNITY *        $644.62  2.47% -3.88% 16.08%
21   NA      BitcoinDark                                                                                              
22   NA         NuShares                                                                                              
23   NA        Primecoin                                                                                              
24   NA     Infinitecoin                                                                                              
25   NA        Startcoin   

Anyone know what is going on?

CodePudding user response:

The issue here is that the page uses Javascript to add rows to the table as you scroll down the page, so data for all rows is not present when you read the page using read_html.

The first 200 rows of data are contained in the page source code inside this tag, as JSON format:

<script id="__NEXT_DATA__" type="application/json">
  ...json here...
</script>

You could retrieve a data frame from there like this:

library(rvest)
library(jsonlite)

json_data <- read_html("https://coinmarketcap.com/historical/20150621/") %>%
  html_node("#__NEXT_DATA__") %>% 
  html_text() %>% 
  fromJSON()

df_data <- json_data$props$initialState$cryptocurrency$listingHistorical$data

dim(df_data)
[1] 200  16

But that data frame has nested columns that you'll have to deal with.

Otherwise you'll need to look at something like RSelenium for scraping dynamic content.

  • Related