Home > Mobile >  Get metadata on csv file in a github repo with R? (i.e., file.info but for online files)
Get metadata on csv file in a github repo with R? (i.e., file.info but for online files)

Time:10-05

Is there a simple, non-API R command or function to get the basic metadata on a csv file that is in a github repository? I especially need: (1) Date of last commit and (2) size in bytes, which I'm trying to pull into an RMarkdown document.

Here is an example file

CodePudding user response:

I don't know of a simple function to do this, but you can write a little web scraping function with rvest to do the job:

library(rvest)

file_metadata <- function(url) {
  
  page <- read_html(url)
  
  file <- tail(strsplit(url, "/")[[1]], 1)
  div1 <- "text-mono f6 flex-auto pr-3 flex-order-2 flex-md-order-1"
  
  size <- page %>%
    html_elements(xpath = paste0("//div[@class='", div1, "']")) %>%
    html_text() %>%
    strsplit("\n") %>%
    sapply(trimws) %>%
    getElement(5)
  
  last_commit <- page %>% 
    html_elements("relative-time") %>% 
    html_attr("datetime") %>%
    as.POSIXct()
  
  data.frame(file, size, last_commit)
}

Testing it on your example file url, we have:

file_metadata(example_file)
#>                  file    size last_commit
#> 1 EB_data_example.csv 1.32 KB  2022-01-18

Created on 2022-10-04 with reprex v2.0.2


Example file url in full

example_file<- paste0("https://github.com/BrunaLab/LAS6292_DataManagement/",
              "blob/4b856c2fad350edaded78fba671023b8c544b1dd/",
              "static/course-materials/class-sessions/03-spreadsheets/examples/",
              "EB_data_example.csv")
  • Related