Home > Software engineering >  Web scraping returns merged sentences in R
Web scraping returns merged sentences in R

Time:09-26

I've scraped some lyrics from the link https://www.vagalume.com.br/ivete-sangalo/. There, the lyrics are displayed as follows (just a snippet):

 Quando a chuva passar

 Pra quê falar 
 Se você não quer me ouvir?
 Fugir agora não resolve nada

As you see, each sentence is split in a new line. But when I scrape the lyrics and save them in a csv file, R returns merged sentences, as follows:

 Output:
 Quando a chuva passarPra quê falarSe você não quer me ouvir?Fugir agora não resolve nada

This is my code:

library(rvest)
library(dplyr)

link <- "https://www.vagalume.com.br/ivete-sangalo/"

page <- read_html(link)

name_link_id <- page %>% html_nodes('.nameMusic') %>% html_attr("href")

name_link_full <-page %>% html_nodes('.nameMusic') %>% html_attr("href") %>%        
paste("https://www.vagalume.com.br", ., sep = "")

get_lyrics <- function(lyrics_link){

lyric <- read_html(lyrics_link)

all_lyrics <- lyric %>% html_nodes('#lyrics') %>% html_text() 
return(all_lyrics)
}

lyr <- sapply(name_link_full, FUN = get_lyrics)

lyrs <- data.frame(lyr, stringsAsFactors = FALSE)

write.csv(lyrs, 'Ivete.Sangalo.csv')

I've tried stringi(), strsplit(), but nothing changes. Please, how can I fix this?

CodePudding user response:

The following function reads in the data and returns a data.frame with one column named lyrics.

library(rvest)
library(dplyr)

get_lyrics <- function(lyrics_link){
  lyrics_link %>%
    read_html() %>%
    html_nodes('#lyrics') %>% 
    html_text2() %>%
    gsub("\\n\\n", "\n", .) %>%
    str_split(pattern = "\\n") %>%
    unlist() %>%
    as.data.frame() %>%
    `names<-`("lyrics")
}
link <- "https://www.vagalume.com.br/ivete-sangalo/"

page <- read_html(link)

name_link_full <- page %>% 
  html_nodes('.nameMusic') %>% 
  html_attr("href") %>%        
  paste("https://www.vagalume.com.br", ., sep = "")

lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)

Edit

Following the comments below, here are two ways of writing the lyrics to file.

First, rbind the vectors of list lyr. And remove the column header.

lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)
lyrs <- lapply(lyr, \(l) paste(unlist(l),  collapse = " "))
lyrs <- do.call(rbind.data.frame, lyrs)
names(lyrs) <- ''

Then, write as csv and as txt. The directory "~/tmp" is optional.

old_dir <- getwd()
setwd("~/tmp")
write.csv(lyrs, 'Ivete.Sangalo.csv', quote = FALSE, row.names = FALSE)
writeLines(unlist(lyrs), con = 'Ivete.Sangalo.txt')
setwd(old_dir)
  • Related