Home > Mobile >  How to extract the date from PDF file names to a data set?
How to extract the date from PDF file names to a data set?

Time:11-28

I am trying to extract the date from multiple PDF's to create a date column in a dataset.

I have a folder holding all the pdf's and am trying to do a topic modelling over a time period, hence I need to extract the dates.

Below is the dataset I have just containing the filenames.

# A tibble: 260 x 1
   filename        
   <chr>           
 
1 ./2012.01.18.pdf
 2 ./2012.02.07.pdf
 3 ./2012.03.12.pdf
 4 ./2012.03.26.pdf
 5 ./2012.04.02.pdf
 6 ./2012.04.04.pdf
 7 ./2012.04.19.pdf
 8 ./2012.05.01.pdf
 9 ./2012.05.07.pdf
10 ./2012.06.14.pdf

Tried "as.Date" with no luck, as I am unable to extract the dates from a file holding the all the PDFs

CodePudding user response:

You must first extract the date string from the name, then coerce to class "Date".

df1 <-'1 ./2012.01.18.pdf
 2 ./2012.02.07.pdf
 3 ./2012.03.12.pdf
 4 ./2012.03.26.pdf
 5 ./2012.04.02.pdf
 6 ./2012.04.04.pdf
 7 ./2012.04.19.pdf
 8 ./2012.05.01.pdf
 9 ./2012.05.07.pdf
10 ./2012.06.14.pdf'
df1 <- read.table(textConnection(df1))

df1$V2 <- sub(".*(\\d{4}.\\d{2}.\\d{2}).*", "\\1", df1$V2)
df1$V2 <- as.Date(df1$V2, "%Y.%m.%d")
df1
#>    V1         V2
#> 1   1 2012-01-18
#> 2   2 2012-02-07
#> 3   3 2012-03-12
#> 4   4 2012-03-26
#> 5   5 2012-04-02
#> 6   6 2012-04-04
#> 7   7 2012-04-19
#> 8   8 2012-05-01
#> 9   9 2012-05-07
#> 10 10 2012-06-14

Created on 2022-11-27 with reprex v2.0.2

CodePudding user response:

In the format, we could specify the extra characters along with the custom format for year (%Y), month (%m) and day (%d)

df$V2 <-  as.Date(df$V2, format = "./%Y.%m.%d.pdf")

-output

> df
   V1         V2
1   1 2012-01-18
2   2 2012-02-07
3   3 2012-03-12
4   4 2012-03-26
5   5 2012-04-02
6   6 2012-04-04
7   7 2012-04-19
8   8 2012-05-01
9   9 2012-05-07
10 10 2012-06-14

data

df <- structure(list(V1 = 1:10, V2 = c("./2012.01.18.pdf", "./2012.02.07.pdf", 
"./2012.03.12.pdf", "./2012.03.26.pdf", "./2012.04.02.pdf", "./2012.04.04.pdf", 
"./2012.04.19.pdf", "./2012.05.01.pdf", "./2012.05.07.pdf", "./2012.06.14.pdf"
)), class = "data.frame", row.names = c(NA, -10L))
  • Related