Home > Software engineering >  How to divide a large dataset into smaller datasets by birth year using very few commands?
How to divide a large dataset into smaller datasets by birth year using very few commands?

Time:11-19

Suppose I have a dataset with people born in different years:

      ID year birth_year outcome
1  10021 2015       1960       1
2  10021 2016       1960       1
3  10021 2017       1960       1
4  10021 2018       1960       0
5  10021 2019       1960       0
6  10022 2015       1968       1
7  10022 2016       1968       0
8  10022 2017       1968       0
9  10022 2018       1968       0
10 10022 2019       1968       0
11 10023 2015       1968       1
12 10023 2016       1968       1
13 10023 2017       1968       1
14 10023 2018       1968       1
15 10023 2019       1968       1
16 10024 2015       1961       0
17 10024 2016       1961       0
18 10024 2017       1961       0
19 10024 2018       1961       1
20 10024 2019       1961       1

I want to split this dataset into smaller datasets according to birth year, and store them as year1960, year1961 and year1968. Specifically,

> year1960

      ID year birth_year outcome
1  10021 2015       1960       1
2  10021 2016       1960       1
3  10021 2017       1960       1
4  10021 2018       1960       0
5  10021 2019       1960       0

> year1961

1 10024 2015       1961       0
2 10024 2016       1961       0
3 10024 2017       1961       0
4 10024 2018       1961       1
5 10024 2019       1961       1

> year1968

1  10022 2015       1968       1
2  10022 2016       1968       0
3  10022 2017       1968       0
4  10022 2018       1968       0
5  10022 2019       1968       0
6  10023 2015       1968       1
7  10023 2016       1968       1
8  10023 2017       1968       1
9  10023 2018       1968       1
10 10023 2019       1968       1

How do I do this with fewest steps possible?

CodePudding user response:

There are probably shorter/better ways to do this but his will work and you'll end up with individual dataframes for each birth year.

# read data
df <-read.csv('data.csv')

# split data by 'birth_year' into list of data frames
df_split <- split(df, with(df, birth_year))

# rename elements of list
names(df_split) <- paste0('year', names(df_split))

# create individual dataframes from list 
list2env(df_split, env = .GlobalEnv)
  •  Tags:  
  • r
  • Related