I'm trying to scrape the WHOLE 'In more languages' table on Wikidata pages, e.g.
However, all this means is that we need to look for a different URL that's sourcing the full table. Using Chrome's developer tools we learn that the table's coming from https://www.wikidata.org/wiki/Special:EntityData/Q3044.json and that's the page we actually want to scrape. If we download that using jsonLite
we don't get the table exactly, but we can reassemble it using some dplyr
tools. Here's a snippet of code that does that:
wiki_data <- jsonlite::read_json("https://www.wikidata.org/wiki/Special:EntityData/Q3044.json")
table_data <- wiki_data$entities$Q3044
library(dplyr)
label_col <- bind_rows(table_data$labels) %>% rename(label=value)
desc_col <- bind_rows(table_data$descriptions) %>% rename(description=value)
alias_col <- bind_rows(table_data$aliases) %>%
rename(alias=value) %>%
group_by(language) %>%
summarise(alias=paste(alias, collapse = ", "))
full_table <- label_col %>%
left_join(desc_col) %>%
left_join(alias_col)
with the first few rows of the output shown below:
> full_table
# A tibble: 157 x 4
language label description alias
<chr> <chr> <chr> <chr>
1 fr Charlemagne empereur d'Occident et roi des Francs Char~
2 en Charlemagne King of the Franks, King of Italy, and Holy Roman~ Karo~
3 it Carlo Magno re dei Franchi e dei Longobardi e primo imperator~ NA
4 ilo Karlomagno Ari dagiti Pranko ken Lombardo ken Emperador ti N~ NA