I've scraped some lyrics from the link https://www.vagalume.com.br/ivete-sangalo/. There, the lyrics are displayed as follows (just a snippet):
Quando a chuva passar
Pra quê falar
Se você não quer me ouvir?
Fugir agora não resolve nada
As you see, each sentence is split in a new line. But when I scrape the lyrics and save them in a csv file, R returns merged sentences, as follows:
Output:
Quando a chuva passarPra quê falarSe você não quer me ouvir?Fugir agora não resolve nada
This is my code:
library(rvest)
library(dplyr)
link <- "https://www.vagalume.com.br/ivete-sangalo/"
page <- read_html(link)
name_link_id <- page %>% html_nodes('.nameMusic') %>% html_attr("href")
name_link_full <-page %>% html_nodes('.nameMusic') %>% html_attr("href") %>%
paste("https://www.vagalume.com.br", ., sep = "")
get_lyrics <- function(lyrics_link){
lyric <- read_html(lyrics_link)
all_lyrics <- lyric %>% html_nodes('#lyrics') %>% html_text()
return(all_lyrics)
}
lyr <- sapply(name_link_full, FUN = get_lyrics)
lyrs <- data.frame(lyr, stringsAsFactors = FALSE)
write.csv(lyrs, 'Ivete.Sangalo.csv')
I've tried stringi()
, strsplit()
, but nothing changes. Please, how can I fix this?
CodePudding user response:
The following function reads in the data and returns a data.frame with one column named lyrics
.
library(rvest)
library(dplyr)
get_lyrics <- function(lyrics_link){
lyrics_link %>%
read_html() %>%
html_nodes('#lyrics') %>%
html_text2() %>%
gsub("\\n\\n", "\n", .) %>%
str_split(pattern = "\\n") %>%
unlist() %>%
as.data.frame() %>%
`names<-`("lyrics")
}
link <- "https://www.vagalume.com.br/ivete-sangalo/"
page <- read_html(link)
name_link_full <- page %>%
html_nodes('.nameMusic') %>%
html_attr("href") %>%
paste("https://www.vagalume.com.br", ., sep = "")
lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)
Edit
Following the comments below, here are two ways of writing the lyrics to file.
First, rbind
the vectors of list lyr
. And remove the column header.
lyr <- lapply(name_link_full[1:5], FUN = get_lyrics)
lyrs <- lapply(lyr, \(l) paste(unlist(l), collapse = " "))
lyrs <- do.call(rbind.data.frame, lyrs)
names(lyrs) <- ''
Then, write as csv and as txt. The directory "~/tmp"
is optional.
old_dir <- getwd()
setwd("~/tmp")
write.csv(lyrs, 'Ivete.Sangalo.csv', quote = FALSE, row.names = FALSE)
writeLines(unlist(lyrs), con = 'Ivete.Sangalo.txt')
setwd(old_dir)