Home > Mobile >  Scrape a list of webpages and put them toegether as dataframe
Scrape a list of webpages and put them toegether as dataframe

Time:09-26

I have an overview page of student statistics https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/index.html and each semester has specific information in a html table element, e.g. https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/2013-Wintersemester/index.html

I would like to scrape all information and put it together as a dataframe. I manually created a char vector of all URLs (perhaps there is another way).

Edit As was mentioned, some URL parts are capitalized, some are not. This list should be complete.

winters <- seq(from=2013, to=2021)
summers <- seq(from=2014, to=2022)
winters <- paste0(winters, "-wintersemester")
summers <- paste0(summers, "-Sommersemester")

all_terms <- c(rbind(winters, summers))
all_terms[1] <- "2013-Wintersemester"
all_terms[3] <- "2014-Wintersemester"

all_url <- paste0("https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/", all_terms, "/index.html")

I can get data for a single page

all_url[1] %>%
  read_html() %>% 
  html_table() %>% 
  as.data.frame()

Studierende gesamt  6645            
weiblich    4206            
männlich    2439            
Deutsche    5001            
Ausländer/innen 1644            
1. Fachsemester 1783            
1. Hochschulsemester    1110    

But fail to write a for loop.

tables <- list()
index <- 1

for(i in length(all_url)){
  table <- all_url[i] %>% 
    read_html() %>% 
    html_table()

  tables[index] <- table

  index <- index   1
}

df <- do.call("rbind", tables)

It would be great to have a dataframe with each sub-page (semester / year) as rows and all student data as columns.

CodePudding user response:

Some appear not to be available. You could solve this using tryCatch and substitute with NA.

library(rvest)
tables <- lapply(all_url, \(x) tryCatch(as.data.frame(html_table(read_html(x))),
                                        error=\(e) NA)) |> setNames(all_terms)

tail(tables, 3)
# $`2021-Sommersemester`
#                     X1   X2
# 1   Studierende gesamt 5131
# 2             weiblich 3037
# 3             männlich 2054
# 4             Deutsche 3698
# 5      Ausländer/innen 1433
# 6      1. Fachsemester  394
# 7 1. Hochschulsemester  143
# 
# $`2021-Wintersemester`
# [1] NA
# 
# $`2022-Sommersemester`
#                     X1   X2
# 1   Studierende gesamt 4851
# 2             weiblich 2847
# 3             männlich 2004
# 4             Deutsche 3360
# 5      Ausländer/innen 1491
# 6      1. Fachsemester  403
# 7 1. Hochschulsemester  189

Thereafter you may want to rbind the non-missings,

na <- is.na(tables)
tables[!na] <- Map(`[<-`, tables[!na], 'sem', value=substr(all_terms[!na], 1, 6))  ## add year column*
res <- do.call(rbind, tables[!is.na(tables)])
head(res)
#                                       X1   X2    sem
# 2013-Wintersemester.1 Studierende gesamt 6645 2013-W
# 2013-Wintersemester.2           weiblich 4206 2013-W
# 2013-Wintersemester.3           männlich 2439 2013-W
# 2013-Wintersemester.4           Deutsche 5001 2013-W
# 2013-Wintersemester.5    Ausländer/innen 1644 2013-W
# 2013-Wintersemester.6    1. Fachsemester 1783 2013-W

*better use sapply(strsplit(substr(all_terms[!na], 1, 6), '-'), \(x) paste(rev(x), collapse='_')) here to get valid names

and reshape the data.

reshape2::dcast(res, X1 ~ sem, value.var='X2')
#                     X1 2013-W 2014-S 2014-W 2015-S 2016-S 2017-S 2018-S 2019-S 2020-S 2021-S 2022-S
# 1      1. Fachsemester   1783    567   1600    557    613    693    810    611    405    394    403
# 2 1. Hochschulsemester   1110    199   1020    224    240    217    273    214     78    143    189
# 3      Ausländer/innen   1644   1510   1649   1501   1576   1613   1551   1527   1369   1433   1491
# 4             Deutsche   5001   4836   4843   4599   4682   4733   4821   4523   4040   3698   3360
# 5             männlich   2439   2347   2394   2255   2292   2388   2468   2388   2197   2054   2004
# 6   Studierende gesamt   6645   6346   6492   6100   6258   6346   6372   6051   5409   5131   4851
# 7             weiblich   4206   3999   4098   3845   3966   3958   3904   3663   3212   3037   2847

CodePudding user response:

Here's a tidyverse approach. Note that I only use links 1:4 because there's something off with (some of) the others.

library(rvest)
library(tidyverse)

# gather terms and corresponding urls
data <- tibble(
  term = all_terms,
  url =  paste0(
   "https://www.europauni.de/de/struktur/verwaltung/dezernat_1/statistiken/", 
    all_terms, "/index.html"
  )
)[1:4,]

# map over `data`, scrape table, rename variables and bind the results
map(1:nrow(data), ~ {
  data$url[.x] %>%
    read_html() %>% 
    html_table() %>%
    `[[`(., 1) %>% 
    mutate(term = data$term[.x]) %>%
    rename(Kategorie = X1, Anzahl = X2)
}) %>%
  bind_rows()

Result:

# A tibble: 28 × 3
   Kategorie            Anzahl term               
   <chr>                 <int> <chr>              
 1 Studierende gesamt     6645 2013-Wintersemester
 2 weiblich               4206 2013-Wintersemester
 3 männlich               2439 2013-Wintersemester
 4 Deutsche               5001 2013-Wintersemester
 5 Ausländer/innen        1644 2013-Wintersemester
 6 1. Fachsemester        1783 2013-Wintersemester
 7 1. Hochschulsemester   1110 2013-Wintersemester
 8 Studierende gesamt     6346 2014-Sommersemester
 9 weiblich               3999 2014-Sommersemester
10 männlich               2347 2014-Sommersemester
...
  • Related