Hi I would like to scrape a single table containing 100 rows, however with rvest it only seem to get up to 20 rows and it stops. Interestingly it captures the first column for the entire table however after row 20 the rest of the columns are NA
library(rvest)
library ( xml2)
html <- rvest::read_html("https://coinmarketcap.com/historical/20150621/")
tables <- html_nodes(html, "table")
df = as.data.frame( rvest:: html_table(tables[[3]], fill = TRUE) )
df = df[ , 1:10]
df[1:25, ]
this is how the table looks like
> df[1:25, ]
Rank Name Symbol Market Cap Price Circulating Supply Volume (24h) % 1h % 24h % 7d
1 1 BTCBitcoin BTC $3,488,111,052.52 $243.94 14,298,800 BTC $10,600,886.00 -0.09% -0.39% 4.33%
2 2 XRPXRP XRP $329,106,281.79 $0.01031 31,908,551,587 XRP * $564,946.56 0.68% -6.49% 26.52%
3 3 LTCLitecoin LTC $121,255,276.52 $3.02 40,119,404 LTC $3,196,087.25 0.66% -0.02% 50.72%
4 4 DOGEDogecoin DOGE $20,882,626.13 $0.0002091 99,890,370,337 DOGE $345,750.50 0.33% -0.46% 25.29%
5 5 BTSBitShares BTS $19,410,447.59 $0.007727 2,511,953,117 BTS * $66,206.36 -1.53% -3.65% 12.20%
6 6 XLMStellar XLM $17,058,468.94 $0.003526 4,837,354,256 XLM * $25,278.98 -2.85% -4.09% 8.34%
7 7 DASHDash DASH $15,581,959.93 $2.84 5,482,231 DASH $42,407.43 -0.17% -1.17% 1.37%
8 8 NXTNxt NXT $13,625,080.25 $0.01363 999,997,096 NXT * $32,074.26 0.99% -3.74% 15.89%
9 9 BANXBanx BANX $9,648,845.01 $1.64 5,894,665 BANX * $15,804.05 -0.11% -0.41% 4.33%
10 10 PPCPeercoin PPC $8,857,457.26 $0.3949 22,428,765 PPC $63,627.21 -0.46% -5.40% 21.14%
11 11 MAIDMaidSafeCoin MAID $8,112,629.90 $0.01793 452,552,412 MAID * $11,125.53 -0.65% -0.56% 7.06%
12 12 NMCNamecoin NMC $5,681,492.39 $0.4815 11,800,400 NMC $16,962.83 -0.99% -4.69% 43.39%
13 13 BCNBytecoin BCN $5,086,827.18 $0.00002924 173,955,598,772 BCN $5,500.92 0.93% 2.81% 2.53%
14 14 XMRMonero XMR $4,286,720.12 $0.5233 8,192,114 XMR $20,025.62 -1.03% -2.23% 5.73%
15 15 BLKBlackCoin BLK $3,932,944.75 $0.05248 74,938,648 BLK * $212,834.00 1.26% -3.55% 42.16%
16 16 XCPCounterparty XCP $3,358,114.93 $1.27 2,640,365 XCP * $2,235.02 -0.09% 3.81% -6.94%
17 17 VTCVertcoin VTC $3,264,822.95 $0.2048 15,941,100 VTC $35,518.47 -2.41% 2.72% 32.79%
18 18 YBCYbCoin YBC $3,161,465.76 $1.05 3,000,000 YBC * $54,359.75 0.11% 2.52% 15.12%
19 19 MONAMonaCoin MONA $2,993,610.25 $0.1452 20,619,400 MONA $8,199.22 -0.88% 3.92% -8.74%
20 20 UNITYSuperNET UNITY $2,675,341.46 $3.28 816,061 UNITY * $644.62 2.47% -3.88% 16.08%
21 NA BitcoinDark
22 NA NuShares
23 NA Primecoin
24 NA Infinitecoin
25 NA Startcoin
Anyone know what is going on?
CodePudding user response:
The issue here is that the page uses Javascript to add rows to the table as you scroll down the page, so data for all rows is not present when you read the page using read_html
.
The first 200 rows of data are contained in the page source code inside this tag, as JSON format:
<script id="__NEXT_DATA__" type="application/json">
...json here...
</script>
You could retrieve a data frame from there like this:
library(rvest)
library(jsonlite)
json_data <- read_html("https://coinmarketcap.com/historical/20150621/") %>%
html_node("#__NEXT_DATA__") %>%
html_text() %>%
fromJSON()
df_data <- json_data$props$initialState$cryptocurrency$listingHistorical$data
dim(df_data)
[1] 200 16
But that data frame has nested columns that you'll have to deal with.
Otherwise you'll need to look at something like RSelenium for scraping dynamic content.