I am trying to extract the date from multiple PDF's to create a date column in a dataset.
I have a folder holding all the pdf's and am trying to do a topic modelling over a time period, hence I need to extract the dates.
Below is the dataset I have just containing the filenames.
# A tibble: 260 x 1
filename
<chr>
1 ./2012.01.18.pdf
2 ./2012.02.07.pdf
3 ./2012.03.12.pdf
4 ./2012.03.26.pdf
5 ./2012.04.02.pdf
6 ./2012.04.04.pdf
7 ./2012.04.19.pdf
8 ./2012.05.01.pdf
9 ./2012.05.07.pdf
10 ./2012.06.14.pdf
Tried "as.Date" with no luck, as I am unable to extract the dates from a file holding the all the PDFs
CodePudding user response:
You must first extract the date string from the name, then coerce to class "Date"
.
df1 <-'1 ./2012.01.18.pdf
2 ./2012.02.07.pdf
3 ./2012.03.12.pdf
4 ./2012.03.26.pdf
5 ./2012.04.02.pdf
6 ./2012.04.04.pdf
7 ./2012.04.19.pdf
8 ./2012.05.01.pdf
9 ./2012.05.07.pdf
10 ./2012.06.14.pdf'
df1 <- read.table(textConnection(df1))
df1$V2 <- sub(".*(\\d{4}.\\d{2}.\\d{2}).*", "\\1", df1$V2)
df1$V2 <- as.Date(df1$V2, "%Y.%m.%d")
df1
#> V1 V2
#> 1 1 2012-01-18
#> 2 2 2012-02-07
#> 3 3 2012-03-12
#> 4 4 2012-03-26
#> 5 5 2012-04-02
#> 6 6 2012-04-04
#> 7 7 2012-04-19
#> 8 8 2012-05-01
#> 9 9 2012-05-07
#> 10 10 2012-06-14
Created on 2022-11-27 with reprex v2.0.2
CodePudding user response:
In the format
, we could specify the extra characters along with the custom format for year (%Y
), month (%m
) and day (%d
)
df$V2 <- as.Date(df$V2, format = "./%Y.%m.%d.pdf")
-output
> df
V1 V2
1 1 2012-01-18
2 2 2012-02-07
3 3 2012-03-12
4 4 2012-03-26
5 5 2012-04-02
6 6 2012-04-04
7 7 2012-04-19
8 8 2012-05-01
9 9 2012-05-07
10 10 2012-06-14
data
df <- structure(list(V1 = 1:10, V2 = c("./2012.01.18.pdf", "./2012.02.07.pdf",
"./2012.03.12.pdf", "./2012.03.26.pdf", "./2012.04.02.pdf", "./2012.04.04.pdf",
"./2012.04.19.pdf", "./2012.05.01.pdf", "./2012.05.07.pdf", "./2012.06.14.pdf"
)), class = "data.frame", row.names = c(NA, -10L))