Home > Software design >  How to scrape 'In more languages' table on Wikidata?
How to scrape 'In more languages' table on Wikidata?

Time:05-28

I'm trying to scrape the WHOLE 'In more languages' table on Wikidata pages, e.g. Gif showing page refresh with only the first row initially loaded

However, all this means is that we need to look for a different URL that's sourcing the full table. Using Chrome's developer tools we learn that the table's coming from https://www.wikidata.org/wiki/Special:EntityData/Q3044.json and that's the page we actually want to scrape. If we download that using jsonLite we don't get the table exactly, but we can reassemble it using some dplyr tools. Here's a snippet of code that does that:


wiki_data <- jsonlite::read_json("https://www.wikidata.org/wiki/Special:EntityData/Q3044.json")
table_data <- wiki_data$entities$Q3044

library(dplyr)
label_col <- bind_rows(table_data$labels) %>% rename(label=value)
desc_col <- bind_rows(table_data$descriptions) %>% rename(description=value)
alias_col <- bind_rows(table_data$aliases) %>% 
  rename(alias=value) %>%
  group_by(language) %>%
  summarise(alias=paste(alias, collapse = ", "))

full_table <- label_col %>%
  left_join(desc_col) %>%
  left_join(alias_col)

with the first few rows of the output shown below:

> full_table
# A tibble: 157 x 4
   language label                         description                                        alias
   <chr>    <chr>                         <chr>                                              <chr>
 1 fr       Charlemagne                   empereur d'Occident et roi des Francs              Char~
 2 en       Charlemagne                   King of the Franks, King of Italy, and Holy Roman~ Karo~
 3 it       Carlo Magno                   re dei Franchi e dei Longobardi e primo imperator~ NA   
 4 ilo      Karlomagno                    Ari dagiti Pranko ken Lombardo ken Emperador ti N~ NA   
  • Related