Home > other >  problem with web scrapping from wikipedia
problem with web scrapping from wikipedia

Time:09-13

I have been practicing web scrapping from wikipedia with the rvest library, and I would like to solve a problem that I found when using the str_replace_all() function. here is the code:

library(tidyverse)   
library(rvest)

pagina <- read_html("https://es.wikipedia.org/wiki/Anexo:Premio_Grammy_al_mejor_álbum_de_rap") %>% 
  # list all tables on the page
  html_nodes(css = "table") %>%
  # convert to a table
  html_table()

rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista <- str_replace_all(rap$Artista, '\\[[^\\]]*\\]', '')
rap$Trabajo <- str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)

The problem is that when I remove the elements between brackets (hyperlinks in wikipedia) from the Artist variable, when doing the tabulation to see the count by artist, Eminem is repeated three times as if it were three different artists, the same happens with Kanye West that is repeated twice. I appreciate any solutions in advance.

CodePudding user response:

There are some hidden bits still attached to the strings and trimws() is not working to remove them. You can use nchar(sort(test)) to see the number of character associated with each entry.

Here is a messy regular expression to extract out the letters, space, comma and - and skip everything else at the end.

rap <- pagina[[2]]
rap <- rap[, -c(5)]

rap$Artista<-gsub("([a-zA-Z -,&] ).*", "\\1", rap$Artista)
rap$Trabajo <- stringr::str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')

table(rap$Artista)


  Cardi B       Chance the Rapper                   Drake                  Eminem                     Jay              Kanye West          Kendrick Lamar 
        1                       1                       1                       6                       1                       4                       2 
Lil Wayne                Ludacris Macklemore & Ryan Lewis                     Nas       Naughty by Nature                 Outkast              Puff Daddy 
        1                       1                       1                       1                       1                       2                       1 
The Fugees      Tyler, the Creator 
         1                       2 

Here is another reguarlar expression that seems a bit clearer:

gsub("[^a-zA-Z]*$", "", rap$Artista)

From the end, replace zero or more characters which are not a to z or A to Z.

  • Related