I need to split a large dataframe of meterological timeseries into a training and validation samples. It contains data from multiple stations, which have varying period of observations. How could I divide it so that proportion of training and validation observations is equal across each station. Given the following dataset:
Station | Date | temp |
---|---|---|
A | 2012-01-01 | -0.8 |
A | 2012-01-02 | 0.1 |
A | 2012-01-03 | 0.5 |
A | 2012-01-04 | 0.4 |
B | 2012-01-01 | 0.1 |
B | 2012-01-02 | 0.5 |
and assuming that the training set should include only first 50% of the observations from each station, the desired output would be:
Station | Date | temp |
---|---|---|
A | 2012-01-01 | -0.8 |
A | 2012-01-02 | 0.1 |
B | 2012-01-01 | 0.1 |
CodePudding user response:
Given your example you could use slice_head
from dplyr. For creating the validation, remove the records that are in training. This to avoid selecting duplictates in case there is an uneven number of records for a station.
training <- df1 %>%
mutate(Date = as.Date(Date),
id = row_number()) %>%
group_by(Station) %>%
slice_head(prop = 0.5)
validation <- df1 %>%
mutate(Date = as.Date(Date),
id = row_number()) %>%
filter(!id %in% training$id)
training
# A tibble: 3 x 4
# Groups: Station [2]
Station Date temp id
<chr> <date> <dbl> <int>
1 A 2012-01-01 -0.8 1
2 A 2012-01-02 0.1 2
3 B 2012-01-01 0.1 5
validation
Station Date temp id
1 A 2012-01-03 0.5 3
2 A 2012-01-04 0.4 4
3 B 2012-01-02 0.5 6