Home > Mobile >  Webscraping in R with rvest and str_extract from nested list
Webscraping in R with rvest and str_extract from nested list

Time:03-18

Consider this as demo code:

library(rvest)
library(dplyr)
library(stringr)
library(tidyr)

##IMPORT AKTUELLE ALLPAX HTMLs
###Liste aller URLS
df_allpax_page <- as.data.frame(c("https://www.allpax.de/index.php/cat/c1059_Schlauchfolien.html","https://www.allpax.de/index.php/cat/c999_Tisch-Folienschwei-geraet.html?Page=1&Items=200&Filter={"category":[999]}&sort=4&view=classic","https://www.allpax.de/index.php/cat/c273_Durchlaufschwei-geraete-2kg-7kg-Beutel.html","https://www.allpax.de/index.php/cat/c998_ALLPAX-Magnet-Folienschwei-ger.html"))


##read files
files_list <- list()
for (j in 1:nrow(df_allpax_page)) {
  html_body <- read_html(df_allpax_page[j,1])
  files_list[[j]] <- html_body
}

body_list <- list()
for (i in 1:length(files_list)) {
  body_nodes <- files_list[[i]] %>% 
    html_node("body") %>% 
    html_children() %>% html_children()
  body_list[[i]] <- body_nodes
}

artikel_list <- list()
for (l in 1:length(body_list)) {
  list_nodes <- body_list[[l]] %>% 
    xml2::xml_find_all("//div[contains(@class, 'article-listitem')]") %>% 
    rvest::html_text()
  artikel_list[[l]] <- list_nodes
}


artikelnummer_liste <- list()
preis_liste <- list()
for (k in artikel_list) {
  for (m in k) {
    art_nr <- stringr::str_extract(artikel_list[m], "Art-Nr.{0,20}")
    preis <- stringr::str_extract(artikel_list[m], ".{0,6} €")
    artikelnummer_liste[[m]] <- art_nr
    preis_liste[[m]] <- preis
  }
}

Basically all i want to do is to extract information from artikel_list and store my results in the lists called artikelnummer_liste and preis_liste.

The problem is it is not extracting the string i am looking for and throws an error:

stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) :
  argument is not an atomic vector; coercing 

Could somebody please help me.

CodePudding user response:

You just had a couple of syntax errors in that final loop. If you change it to:

artikelnummer_liste <- list()
preis_liste         <- list()

for (k in seq_along(artikel_list)) {
    art_nr <- stringr::str_extract(artikel_list[[k]], "Art-Nr\\.(.{0,20})")
    preis <- stringr::str_extract(artikel_list[[k]], ".{0,6} €")
    artikelnummer_liste[[k]] <- art_nr
    preis_liste[[k]] <- preis
  }

You get your results.

For completeness, you can put the final results in a nice list of data frames like this:

result <- Map(function(a, b) data.frame(artikelnummer = a, preis = b),
              a = artikelnummer_liste, b = preis_liste)

The result is pretty long, but looks like this:

result
#> [[1]]
#>         artikelnummer    preis
#> 1  Art-Nr. 10006846;0   1,50 €
#> 2  Art-Nr. 10001538;0  28,30 €
#> 3  Art-Nr. 10001539;0  29,40 €
#> 4  Art-Nr. 10001537;0  32,20 €
#> 5  Art-Nr. 10001540;0  33,00 €
#> 6  Art-Nr. 10001541;0  37,90 €
#> 7  Art-Nr. 10001543;0  38,50 €
#> 8  Art-Nr. 10001544;0  41,80 €
#> 9  Art-Nr. 10001545;0  46,60 €
#> 10 Art-Nr. 10001546;0  53,30 €
#> 11 Art-Nr. 10001547;0  57,20 €
#> 12 Art-Nr. 10001548;0  80,70 €
#> 13 Art-Nr. 10001549;0  95,50 €
#> 14 Art-Nr. 10001554;0  96,20 €
#> 15 Art-Nr. 10001550;0 100,20 €
#> 16 Art-Nr. 10001555;0 109,40 €
#> 17 Art-Nr. 10001551;0 114,40 €
#> 18 Art-Nr. 10001542;0 126,40 €
#> 19 Art-Nr. 10001556;0 135,60 €
#> 20 Art-Nr. 10001553;0 137,60 €
#> 21 Art-Nr. 10001552;0 143,20 €
#> 
#> [[2]]
#>         artikelnummer    preis
#> 1  Art-Nr. 10015386;0  59,90 €
#> 2  Art-Nr. 10015387;0  59,90 €
#> 3  Art-Nr. 10001474;0  60,00 €
#> 4  Art-Nr. 10001524;0  60,50 €
#> 5  Art-Nr. 10001525;0  65,50 €
#> 6  Art-Nr. 10001475;0  66,50 €
#> 7  Art-Nr. 10015389;0  72,50 €
#> 8  Art-Nr. 10015390;0  72,50 €
#> 9  Art-Nr. 10015454;0  73,80 €
#> 10 Art-Nr. 10001599;0 113,94 €
#> 11 Art-Nr. 10015391;0 114,50 €
#> 12 Art-Nr. 10015450;0 128,00 €
#> 13 Art-Nr. 10001651;0 142,50 €
#> 14 Art-Nr. 10015451;0 145,00 €
#> 15 Art-Nr. 10001602;0 173,05 €
#> 16 Art-Nr. 10015367;0 179,00 €
#> 17 Art-Nr. 10015392;0 188,00 €
#> 18 Art-Nr. 10015368;0 194,00 €
#> 19 Art-Nr. 10001604;0 217,24 €
#> 20 Art-Nr. 10001650;0 218,11 €
#> 21 Art-Nr. 10001526;0 242,00 €
#> 22 Art-Nr. 10002166;0 242,00 €
#> 23 Art-Nr. 10001528;0 308,00 €
#> 24 Art-Nr. 10002169;0 330,00 €
#> 25 Art-Nr. 10001527;0 354,00 €
#> 26 Art-Nr. 10001665;0 419,00 €
#> 27 Art-Nr. 10001471;0 460,00 €
#> 28 Art-Nr. 10015374;0 725,00 €
#> 29 Art-Nr. 10001472;0 814,00 €
#> 30 Art-Nr. 10001473;0 883,00 €
#> 31 Art-Nr. 10015371;0 285,00 €
#> 32 Art-Nr. 10015376;0 395,00 €
#> 33 Art-Nr. 10015455;0 528,00 €
#> 34 Art-Nr. 10015373;0 685,00 €
#> 
#> [[3]]
#>        artikelnummer    preis
#> 1 Art-Nr. 10001499;0 645,00 €
#> 2 Art-Nr. 10016124;0 655,69 €
#> 3 Art-Nr. 10016123;0 659,00 €
#> 4 Art-Nr. 10008196;0 280,77 €
#> 5 Art-Nr. 10006044;0 306,00 €
#> 6 Art-Nr. 10006045;0 306,00 €
#> 7 Art-Nr. 10010473;0 871,00 €
#> 8 Art-Nr. 10007346;0 390,00 €
#> 
#> [[4]]
#>         artikelnummer    preis
#> 1  Art-Nr. 10001669;0  42,70 €
#> 2  Art-Nr. 10001536;0  83,80 €
#> 3  Art-Nr. 10001560;0  93,20 €
#> 4  Art-Nr. 10001641;0 110,00 €
#> 5  Art-Nr. 10001559;0 115,00 €
#> 6  Art-Nr. 10001533;0 233,00 €
#> 7  Art-Nr. 10001534;0 272,00 €
#> 8  Art-Nr. 10001608;0 393,29 €
#> 9  Art-Nr. 10001607;0 563,07 €
#> 10 Art-Nr. 10001529;0 674,00 €
#> 11 Art-Nr. 10001577;0 787,00 €
#> 12 Art-Nr. 10001530;0 879,00 €
#> 13 Art-Nr. 10001531;0 916,00 €
#> 14 Art-Nr. 10001570;0 980,00 €
#> 15 Art-Nr. 10001532;0 370,00 €
#> 16 Art-Nr. 10001666;0 833,00 €
#> 17 Art-Nr. 10001667;0 397,00 €

Created on 2022-03-17 by the reprex package (v2.0.1)

  • Related