I want to generate random day of the month depending on the month of the year. My current code is:
df$new_day = case_when(
df$new_month == 2 ~ (floor(runif(1, min=1, max=28))),
df$new_month == 1 ~ floor(runif(1, min=1, max=31)),
df$new_month == 3 ~ floor(runif(1, min=1, max=31)),
df$new_month == 5 ~ floor(runif(1, min=1, max=31)),
df$new_month == 7 ~ floor(runif(1, min=1, max=31)),
df$new_month == 8 ~ floor(runif(1, min=1, max=31)),
df$new_month == 10 ~ floor(runif(1, min=1, max=31)),
df$new_month == 12 ~ floor(runif(1, min=1, max=31)),
TRUE ~ floor(runif(1, min=1, max=30))
)
However, my day for a given month is all the same. For instance, all the dates for Feb will be 23.
How can I truly randomize the date within each month?
CodePudding user response:
You are explicitly calling for 1 random number each time: runif(1, ...)
. Instead, use runif(n(), ...)
. Realize that it isn't called once for each row, it is run once for all rows that meet that condition. In my example below, there are three rows in May, but runif
is called as runif(1,..)
and that single number is applied to all three rows.
Sample data:
set.seed(42)
df <- data.frame(day = as.Date("2022-01-01") sample(364, size=10)) %>%
arrange(day) %>%
mutate(month = as.POSIXlt(day)$mon 1L)
df
# day month
# 1 2022-02-19 2
# 2 2022-03-16 3
# 3 2022-05-03 5
# 4 2022-05-09 5
# 5 2022-05-27 5
# 6 2022-06-03 6
# 7 2022-08-17 8
# 8 2022-10-31 10
# 9 2022-11-18 11
# 10 2022-12-31 12
Broken:
library(dplyr)
set.seed(42)
df %>%
mutate(
new_day = case_when(
month == 2 ~ floor(runif(1, 1, 28)),
month %in% c(9, 4, 6, 11) ~ floor(runif(1, 1, 30)),
TRUE ~ floor(runif(1, 1, 31))
)
)
# day month new_day
# 1 2022-02-19 2 25
# 2 2022-03-16 3 9
# 3 2022-05-03 5 9
# 4 2022-05-09 5 9
# 5 2022-05-27 5 9
# 6 2022-06-03 6 28
# 7 2022-08-17 8 9
# 8 2022-10-31 10 9
# 9 2022-11-18 11 28
# 10 2022-12-31 12 9
To demonstrate that runif
is being called once for all rows that meet each criterion, I'll add message
to each. If we could rely on runif(1,..)
, then we should see "30d"
printed to the console 7 times and "31d"
twice, but we don't.
set.seed(42)
df %>%
mutate(
new_day = case_when(
month == 2 ~ { message("Feb: ", length(month)); floor(runif(1, 1, 28)); },
month %in% c(9, 4, 6, 11) ~ { message("30d: ", length(month)); floor(runif(1, 1, 30)); },
TRUE ~ { message("31d: ", length(month)); floor(runif(1, 1, 31)); }
)
)
# Feb: 10
# 30d: 10
# 31d: 10
# day month new_day
# 1 2022-02-19 2 25
# 2 2022-03-16 3 9
# 3 2022-05-03 5 9
# 4 2022-05-09 5 9
# 5 2022-05-27 5 9
# 6 2022-06-03 6 28
# 7 2022-08-17 8 9
# 8 2022-10-31 10 9
# 9 2022-11-18 11 28
# 10 2022-12-31 12 9
This demonstrates that when we're 'inside' the RHS of one of the conditions, it is a call for all rows of the frame. Notice that each time we call runif
, it sees all values of month
(we have 10 rows in df
).
Instead, use n()
(number of rows in each call):
set.seed(42)
df %>%
mutate(
new_day = case_when(
month == 2 ~ floor(runif(n(), 1, 28)),
month %in% c(9, 4, 6, 11) ~ floor(runif(n(), 1, 30)),
TRUE ~ floor(runif(n(), 1, 31))
)
)
# day month new_day
# 1 2022-02-19 2 25
# 2 2022-03-16 3 5
# 3 2022-05-03 5 30
# 4 2022-05-09 5 29
# 5 2022-05-27 5 3
# 6 2022-06-03 6 28
# 7 2022-08-17 8 12
# 8 2022-10-31 10 28
# 9 2022-11-18 11 14
# 10 2022-12-31 12 26
CodePudding user response:
You could create a little helper function which will return the number of day for each month.
month_days <- function(x) case_when(
x == 2 ~ 28,
x %in% c(1,3,5,7,8,10) ~ 31,
TRUE ~ 30
)
Then you can use the fact that max=
is vectorized in runif
to get all the values at once. Noce that since you are doing floor()
you'll want to add 1 to the max so you have a chance of observing that value
set.seed(22)
# test data
N <- 50
dd <- data.frame(new_month = sample(1:12, N, replace=TRUE))
dd$new_day <- floor( runif( length(dd$new_month), min=1, max=month_days(dd$new_month) 1 ) )
CodePudding user response:
sampl
ing from a seq.Date
call that exploits values stored in POSIXlt
. We may easily substitute days and increment the month (but subtract one day). This automatically takes into account the leap years etc.
f <- \(x) {
sample(with(as.POSIXlt(x),
seq.Date(as.Date(ISOdate(year 1900, mon 1, 1, 0)),
as.Date(ISOdate(year 1900, mon 2, 1, 0)) - 1, 'day')),
1)
}
res <- transform(df, new_date=do.call(c, lapply(df$date, f)))
res
# x date new_date
# 1 0.9148060 2021-06-17 2021-06-22
# 2 0.9370754 2022-08-13 2022-08-18
# 3 0.2861395 2020-08-23 2020-08-13
# 4 0.8304476 2022-07-30 2022-07-28
# 5 0.6417455 2021-07-20 2021-07-05
# 6 0.5190959 2021-09-23 2021-09-04
# 7 0.7365883 2020-09-12 2020-09-02
# 8 0.1346666 2022-05-20 2022-05-24
# 9 0.6569923 2021-05-09 2021-05-18
# 10 0.7050648 2019-09-16 2019-09-03
# 11 0.4577418 2022-08-30 2022-08-24
# 12 0.7191123 2020-04-25 2020-04-23
# 13 0.9346722 2022-08-14 2022-08-17
# 14 0.2554288 2019-01-24 2019-01-21
# 15 0.4622928 2022-03-27 2022-03-26
# 16 0.9400145 2019-10-26 2019-10-18
# 17 0.9782264 2020-02-10 2020-02-06
# 18 0.1174874 2019-11-10 2019-11-06
# 19 0.4749971 2022-08-08 2022-08-02
# 20 0.5603327 2021-04-15 2021-04-20
Not really sure though if you want dates or numbers. If you want new months and days to be displayed as numbers you may do
within(res, {
new_date <- do.call(c, lapply(df$date, f))
month <- strftime(new_date, '%m')
day <- strftime(new_date, '%d')
}) |>
type.convert(as.is=TRUE)
# x date new_date day month
# 1 0.9148060 2021-06-17 2021-06-03 3 6
# 2 0.9370754 2022-08-13 2022-08-22 22 8
# 3 0.2861395 2020-08-23 2020-08-21 21 8
# 4 0.8304476 2022-07-30 2022-07-02 2 7
# 5 0.6417455 2021-07-20 2021-07-23 23 7
# 6 0.5190959 2021-09-23 2021-09-06 6 9
# 7 0.7365883 2020-09-12 2020-09-26 26 9
# 8 0.1346666 2022-05-20 2022-05-10 10 5
# 9 0.6569923 2021-05-09 2021-05-08 8 5
# 10 0.7050648 2019-09-16 2019-09-05 5 9
# 11 0.4577418 2022-08-30 2022-08-01 1 8
# 12 0.7191123 2020-04-25 2020-04-17 17 4
# 13 0.9346722 2022-08-14 2022-08-07 7 8
# 14 0.2554288 2019-01-24 2019-01-04 4 1
# 15 0.4622928 2022-03-27 2022-03-13 13 3
# 16 0.9400145 2019-10-26 2019-10-10 10 10
# 17 0.9782264 2020-02-10 2020-02-09 9 2
# 18 0.1174874 2019-11-10 2019-11-29 29 11
# 19 0.4749971 2022-08-08 2022-08-12 12 8
# 20 0.5603327 2021-04-15 2021-04-20 20 4
Data:
df <- structure(list(x = c(0.914806043496355, 0.937075413297862, 0.286139534786344,
0.830447626067325, 0.641745518893003, 0.519095949130133, 0.736588314641267,
0.13466659723781, 0.656992290401831, 0.705064784036949, 0.45774177624844,
0.719112251652405, 0.934672247152776, 0.255428824340925, 0.462292822543532,
0.940014522755519, 0.978226428385824, 0.117487361654639, 0.474997081561014,
0.560332746244967), date = structure(c(18795, 19217, 18497, 19203,
18828, 18893, 18517, 19132, 18756, 18155, 19234, 18377, 19218,
17920, 19078, 18195, 18302, 18210, 19212, 18732), class = "Date")), class = "data.frame", row.names = c(NA,
-20L))