Is there a simple, non-API R command or function to get the basic metadata on a csv file that is in a github repository? I especially need: (1) Date of last commit and (2) size in bytes, which I'm trying to pull into an RMarkdown document.
Here is an example file
CodePudding user response:
I don't know of a simple function to do this, but you can write a little web scraping function with rvest
to do the job:
library(rvest)
file_metadata <- function(url) {
page <- read_html(url)
file <- tail(strsplit(url, "/")[[1]], 1)
div1 <- "text-mono f6 flex-auto pr-3 flex-order-2 flex-md-order-1"
size <- page %>%
html_elements(xpath = paste0("//div[@class='", div1, "']")) %>%
html_text() %>%
strsplit("\n") %>%
sapply(trimws) %>%
getElement(5)
last_commit <- page %>%
html_elements("relative-time") %>%
html_attr("datetime") %>%
as.POSIXct()
data.frame(file, size, last_commit)
}
Testing it on your example file url, we have:
file_metadata(example_file)
#> file size last_commit
#> 1 EB_data_example.csv 1.32 KB 2022-01-18
Created on 2022-10-04 with reprex v2.0.2
Example file url in full
example_file<- paste0("https://github.com/BrunaLab/LAS6292_DataManagement/",
"blob/4b856c2fad350edaded78fba671023b8c544b1dd/",
"static/course-materials/class-sessions/03-spreadsheets/examples/",
"EB_data_example.csv")