Home > Software engineering >  Web-scraping table with merged row entries in R
Web-scraping table with merged row entries in R

Time:10-04

I'm trying to scrape data-tables from a website

https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/

The issue is that the first column entry is merged across multiple rows while the second column has discrete entries:

enter image description here

The data table which is being scraped llos something like:

enter image description here

Here different entries in the second column has been merges in a single entry using \n Now I want to shift the merged data to different rows and need some help with the same.

The code for webscraping is

library(rvest
#Spotify's list of new artists to look out for
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- html_table(upcoming_artists)

The erroneous data frame looks something like:

list(structure(list(X1 = c("United States", "United Kingdom", 
"Brazil", "Mexico", "Argentina", "Colombia", "Panama", "Spain", 
"Australia", "France", "UAE & Lebanon", "South Africa", "Philippines", 
"Indonesia", "Taiwan", "Austria", "Germany", "Netherlands", "Japan\n*RADAR locally titled Early Noise", 
"India"), X2 = c("Alaina Castillo", "Young T   Bugsey", "Agnes Nunes", 
"Silvana Estrada", "Romeo El Santo", "Ela Minus", "Boza", "DORA \nAleesha\nMaría José Llergo\nGuitarricadelafuente\nParanoid 1966", 
"merci, mercy", "Lous and the Yakuza \nYuzmv\nPhilippine\nHervé", 
"Hollaphonic x Xriss", "Elaine", "SB19\nAugust Wahh", "Mahen\nMonica Karina", 
"張若凡RuoFan", "AVEC \nMy Ugly Clementine", "badmómzjay", 
"RIMON \nJeangu Macrooy", "Fujii Kaze\nVaundy\nRina Sawayama", 
"Mali\nWhen Chai Met Toast\nTaba Chake")), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame")))

CodePudding user response:

Use separate_rows from package tidyr to separate a given column into rows.
I have changed the scrapping the a bit.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})

#Spotify's list of new artists to look out for
upcoming_artists_link <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"

upcoming_artists <- read_html(upcoming_artists_link)
upcoming_artists %>%
  html_elements("tbody") %>%
  html_table() %>%
  `[[`(1) %>%
  tidyr::separate_rows(X2, sep = "\n")
#> # A tibble: 35 × 2
#>    X1             X2               
#>    <chr>          <chr>            
#>  1 United States  Alaina Castillo  
#>  2 United Kingdom Young T   Bugsey 
#>  3 Brazil         Agnes Nunes      
#>  4 Mexico         Silvana Estrada  
#>  5 Argentina      Romeo El Santo   
#>  6 Colombia       Ela Minus        
#>  7 Panama         Boza             
#>  8 Spain          DORA             
#>  9 Spain          Aleesha          
#> 10 Spain          María José Llergo
#> # … with 25 more rows

Created on 2022-10-03 with reprex v2.0.2

  • Related