Home > Mobile >  Webscraping with CSS Selector results in more data then required in nodes
Webscraping with CSS Selector results in more data then required in nodes

Time:06-20

I'm trying to scrape https://nomics.com/ for asset and exchange data. I want to get rank, name, price etc. for every page (100 rows) for up in i pages. I've successfully been able to this same for all of the exchanges listed there. I'm using the CSS Selector tool in Chrome (and Brave) browser to obtain the node ID.

Exchanges MWE

# libraries & dependencies
library(rvest)
library(dplyr)

# website url
base_url <- "https://nomics.com/exchanges/" 

# empty list to store page results
datalist <- list()

for(i in 1:6){
  new_url <- paste0(base_url, i) 
  
  page = read_html(new_url)
  
  rank = page %>% html_nodes(".n-pv12.f6-ns") %>% html_text()
  name = page %>% html_nodes(".fw5.nowrap.truncate") %>% html_text()
  impact_score = page %>% html_nodes(".n-ph6.f6") %>% html_text()
  volume = page %>% html_nodes(".f6-ns .mono") %>% html_text()
  volume_percent = page %>% html_nodes(".mono.n-dtc-1120") %>% html_text()
  rating = page %>% html_nodes(".n-dtc-1120  td") %>% html_text()
  trades = page %>% html_nodes(".n-pv18.n-dtc-650") %>% html_text()
  trades_percent = page %>% html_nodes(".mono.n-dtc-1280") %>% html_text()
  pairs = page %>% html_nodes(".n-pv18.n-dtc-768") %>% html_text()
  fiat  = page %>% html_nodes(".mono.n-dtc-1024") %>% html_text()
  
  datalist[[i]] <- data.frame(rank, name, impact_score, volume, volume_percent, rating, trades, trades_percent, pairs, fiat)
  
}

# combine results and store as tibble
big_data = do.call(rbind, datalist)
tibble(big_data)

When I run this, I get a nice tibble with everything I can wish for.

Cryptocurrency Assets MWE

Now, when I try to do the same for the asset data on the homepage itself I can't seem to select the correct nodes with the CSS selector tool. I've tried different CSS selector tools, went into developer screen in chrome and tried different browsers. It seems that more data is included in the nodes that I selected. Because of this I wrangle with them.

For the rank variable, the output is messed up. Sometimes another number is added BEHIND the actual rank itself. For the other variables, I could filter out rows.

# base url
base_url <- "https://nomics.com/" 
page <- read_html(base_url)

### extract html nodes ###

# extract rank, output is messed up on the numbering
rank <- page %>% html_nodes(".flex-column.n-pl6") %>% html_text() %>% tibble(rank = .)

# extract names
name <- page %>% html_nodes(".overflow-visible") %>% html_text() %>% tibble(name = .)
name <- name[201:300,] # output is messy, filtering required rows

# extract price
price <- page %>% html_nodes(".n-dark-gray.f7-s.fw5") %>% html_text() %>% tibble(price = .)
price <- price[1:100,] # output is messy, filtering required rows

df <- rank %>% bind_cols(name, price)

It seems that only the rank variable appears to be messed up and I'm at a loss on how to correct this. I guess I could use regex or subsetting to fix this but I'm hoping there is another way to approach this

Tibbles were just easier for me to wrangle, when I try to loop this just as I did for the exchanges it obviously messes up the results.

Anyone here any idea why the results are messed up on the homepage and on /exchanges it's a breeze?`

Update 1

Trying to loop according to 1st answer. Without a loop, it works fine for all separate pages. It runs correctly but won't subset or store/append the different pages in the list to rbind.

# base url
base_url <- "https://nomics.com/" 

# create empty list to store pages in
datalist <- list()

# create loop for i pages
for(i in 1:5){
  
  # new url ( 1)
  new_url <- paste0(base_url, i)
  
  message("Retrieving page ", i)
  
  # new html page
  data <- read_html(new_url) %>% 
    html_element('#__NEXT_DATA__') %>% 
    html_text %>% 
    jsonlite::parse_json(simplifyVector = T)
  
  listings[[i]] <- data$props$pageProps$data$currenciesTicker[,1:25]

}

big_data = do.call(rbind, listings)

tibble(big_data)

View(big_data)
``

CodePudding user response:

That data is stored in a script tag. You could extract from the script tag and have a dataframe off-the-bat.

library(rvest)
library(jsonlite)
library(tidyverse)

data  <- read_html('https://nomics.com/') %>% 
  html_element('#__NEXT_DATA__') %>% 
  html_text %>% 
  jsonlite::parse_json(simplifyVector = T)

listings <- data$props$pageProps$data$currenciesTicker
head(listings) %>% select(id, rank)

You can turn the above into a function and use map_dfr to generate 1 big dataframe from the desired input urls. Then subset for whichever columns you want.

library(rvest)
library(jsonlite)
library(tidyverse)

get_df <- function(url) {
  data <- read_html(url) %>%
    html_element("#__NEXT_DATA__") %>%
    html_text() %>%
    jsonlite::parse_json(simplifyVector = T)
  listings <- data$props$pageProps$data$currenciesTicker
  return(listings)
}

urls <- paste0("https://nomics.com/", 1:5)

big_df <- map_dfr(urls, get_df)
  • Related