I have a very large data frame with a column with years.
Say this df looks like this:
df <- data.frame(var1 = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N"),
ANO = c(1990, 1994, 2008, 2019, 1992, 2003, 2008, 2001, 2015, 1998, 2012, 1985, 1970, 1965))
I want to subset this large df into a list of dfs based on the column 'ANO' with a 'smaller or equal than' expression.
Importantly: The original dataset has values ranging c.1953 -2020, but I only need the new 31 dfs for the years 1990-2020. However, the individual dfs need to have the rows for the years before 1990 as well. This is, the new list would need to have the 31 dfs (df_1990 - df_2020) and, for instance, the df for 1990 would need to include all rows with year before 1990; df_1991, all rows with years before 1991, and so on until 2020.
It works if I specify each dataframe I want to subset individually, like this:
new_list <- list()
new_list$df_2010 <- df[df$ANO <= 2010,]
new_list$df_2011 <- df[df$ANO <= 2011,]
new_list$df_2012 <- df[df$ANO <= 2012,]
etc.
I would like to avoid copying 30 lines every time, so I tried something like this:
years <- seq(from=1990, to=2020)
new_list <- list()
for(i in 1:length(years)) {
df_loop <- df[df$ANO <= years[i],]
new_list <- df_loop[[i]]
}
But it does not work, I get this error:
Error in .subset2(x, i, exact = exact) : subscript out of bounds
Any suggestions on how to achieve that result?
CodePudding user response:
You can use lapply
to iterate over every value in df$ANO
, and use subset
to pick rows that is smaller than or equal to ANO
. Wrap the lapply
function with setNames
to name the sub-list data frames.
Edit: If you want to only include a specific range, you can play around with the X
argument in lapply
. Here we'll only loop through 1990
to 2020
in lapply
.
new_list <- setNames(lapply(sort(df$ANO)[sort(df$ANO) >= 1990], function(x) subset(df, ANO <= x)), paste0("df_", sort(df$ANO)[sort(df$ANO)>=1990]))
Output
new_list
$df_1990
var1 ANO
1 A 1990
$df_1992
var1 ANO
1 A 1990
5 E 1992
$df_1994
var1 ANO
1 A 1990
2 B 1994
5 E 1992
$df_1998
var1 ANO
1 A 1990
2 B 1994
5 E 1992
10 J 1998
$df_2001
var1 ANO
1 A 1990
2 B 1994
5 E 1992
8 H 2001
10 J 1998
$df_2003
var1 ANO
1 A 1990
2 B 1994
5 E 1992
6 F 2003
8 H 2001
10 J 1998
$df_2008
var1 ANO
1 A 1990
2 B 1994
3 C 2008
5 E 1992
6 F 2003
7 G 2008
8 H 2001
10 J 1998
$df_2008
var1 ANO
1 A 1990
2 B 1994
3 C 2008
5 E 1992
6 F 2003
7 G 2008
8 H 2001
10 J 1998
$df_2012
var1 ANO
1 A 1990
2 B 1994
3 C 2008
5 E 1992
6 F 2003
7 G 2008
8 H 2001
10 J 1998
11 K 2012
$df_2015
var1 ANO
1 A 1990
2 B 1994
3 C 2008
5 E 1992
6 F 2003
7 G 2008
8 H 2001
9 I 2015
10 J 1998
11 K 2012
$df_2019
var1 ANO
1 A 1990
2 B 1994
3 C 2008
4 D 2019
5 E 1992
6 F 2003
7 G 2008
8 H 2001
9 I 2015
10 J 1998
11 K 2012