I'm working on a problem where I need to merge two datasets. The first dataset is from SQL and imported using the RODBC library, while the second dataset is imported from Excel. I want to merge the two dataframes by month and year, however in order to do that, I need to convert the first DF's date column into year-month, from year-month-date.
I have tried to use as.Date(df$postingdate, format = '%Y %M'
or strftime(df$postingdate,"%Y %m")
as I normally would do, however the first doesn't work and the second changes the column to character. It has been a problem for days, and I have tried a number of things, mainly suggestions from the following link: [https://stackoverflow.com/questions/6242955/converting-year-and-month-yyyy-mm-format-to-a-date][1]
In the bottom I have created a df from output I get when using `dput()´ (df2) and I noticed that under posting date, the data is converted to a number, rather than the actual date (“2020-05-28”, “2020-10-09”, "2021-10-19"). Therefor I’m also unsure whatever I have problem because I use the wrong functions, or because the data is of a “unknown” data type.
A sample of the first dataset where I want to transform date into year – month:
df <- data.frame(
Posting_Date = c("2020-05-28", "2020-10-09", "2021-10-19"), Sales = c(2702.5, 369, 4134),
Sales_person_code = c(6L, 10L, 10L), EDI = c(1L, 1L, 1L),
City = c(141L, 4L, 6L), Kæde = c(12L, 12L, 12L),
Vinter = c(0, 0, 0), Forår = c(1, 0, 0), Sommer = c(0, 0, 0),
Efterår = c(0, 1, 1), Fredag = c(0, 1, 0), Lørdag = c(0, 0, 0),
Mandag = c(0, 0, 0), Onsdag = c(0, 0, 0), Søndag = c(0, 0, 0),
Tirsdag = c(0, 0, 1), Torsdag = c(1, 0, 0),
year_month = c("2020-05-28", "2020-10-09", "2021-10-19"))
df2 <- data.frame(
Posting_Date = c(18410, 18544, 18919), Sales = c(2702.5, 369, 4134),
Sales_person_code = c(6L, 10L, 10L),EDI = c(1L, 1L, 1L),
City = c(141L, 4L, 6L), Kæde = c(12L, 12L, 12L),
Vinter = c(0, 0, 0), Forår = c(1, 0, 0), Sommer = c(0, 0, 0),
Efterår = c(0, 1, 1), Fredag = c(0, 1, 0), Lørdag = c(0, 0, 0),
Mandag = c(0, 0, 0), Onsdag = c(0, 0, 0), Søndag = c(0, 0, 0),
Tirsdag = c(0, 0, 1), Torsdag = c(1, 0, 0),
year_month = c(18410, 18544, 18919))
Thanks in advance for any help. Plz let me know if i can do anything to help you guys, helping me
CodePudding user response:
Up front, your attempt of as.Date(df$Posting_Date, format="%Y %m")
seems backwards: the function as.Date
is for converting from a string to a Date
-class, and its format=
argument is to identify how to find the year/month/day components of the string, not how you want to convert it later. (Note that in R, a Date
is shown as YYYY-MM-DD
. Always. Telling R you want a date to be just year/month is saying that you want to convert it to a string, no longer date-like or number-like. lubridate
and perhaps other packages allow you to have similar-to-Date
like objects.)
For df
, one can just subset the strings without parsing to Date
-class:
substring(df$Posting_Date, 1, 7)
# [1] "2020-05" "2020-10" "2021-10"
If you want to do anything number-like to them, you can convert to Date
-class first, and then use format(.)
to convert to a string with a specific format.
as.Date(df$Posting_Date)
# [1] "2020-05-28" "2020-10-09" "2021-10-19"
format(as.Date(df$Posting_Date), format = "%Y-%m")
# [1] "2020-05" "2020-10" "2021-10"
For df2
, though, since it is numeric you need to specify an origin=
instead of a format=
. I'm inferring that these are based off of epoch, so
as.Date(df2$Posting_Date, origin = "1970-01-01")
# [1] "2020-05-28" "2020-10-09" "2021-10-19"
format(as.Date(df2$Posting_Date, origin = "1970-01-01"), format = "%Y-%m")
# [1] "2020-05" "2020-10" "2021-10"
Note that R stores Date
(and POSIXct
, incidentally) as numbers internally:
dput(as.Date(df2$Posting_Date, origin = "1970-01-01"))
# structure(c(18410, 18544, 18919), class = "Date")