I have been practicing web scrapping from wikipedia with the rvest library, and I would like to solve a problem that I found when using the str_replace_all() function. here is the code:
library(tidyverse)
library(rvest)
pagina <- read_html("https://es.wikipedia.org/wiki/Anexo:Premio_Grammy_al_mejor_álbum_de_rap") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# convert to a table
html_table()
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista <- str_replace_all(rap$Artista, '\\[[^\\]]*\\]', '')
rap$Trabajo <- str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
The problem is that when I remove the elements between brackets (hyperlinks in wikipedia) from the Artist variable, when doing the tabulation to see the count by artist, Eminem is repeated three times as if it were three different artists, the same happens with Kanye West that is repeated twice. I appreciate any solutions in advance.
CodePudding user response:
There are some hidden bits still attached to the strings and trimws() is not working to remove them. You can use nchar(sort(test))
to see the number of character associated with each entry.
Here is a messy regular expression to extract out the letters, space, comma and - and skip everything else at the end.
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista<-gsub("([a-zA-Z -,&] ).*", "\\1", rap$Artista)
rap$Trabajo <- stringr::str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
Cardi B Chance the Rapper Drake Eminem Jay Kanye West Kendrick Lamar
1 1 1 6 1 4 2
Lil Wayne Ludacris Macklemore & Ryan Lewis Nas Naughty by Nature Outkast Puff Daddy
1 1 1 1 1 2 1
The Fugees Tyler, the Creator
1 2
Here is another reguarlar expression that seems a bit clearer:
gsub("[^a-zA-Z]*$", "", rap$Artista)
From the end, replace zero or more characters which are not a to z or A to Z.