I have a big data set containing information about the number of years children lived at home, education status of their mother and survey year. I'm trying to create a table that shows the percentage of people living with their parents for 15 years during each survey year, controlled for their mothers education (using casmin-index).
The table should look something like this:
% 15 years in survey year 2000 | % 15 years in survey year 2005 | % 15 years in survey year 2010 | |
---|---|---|---|
casmin_mother1 | ... | ||
casmin_mother2 | |||
casmin_mother3 |
Example data would be:
df <- data.frame(years = c(15,14,15,15,12,15,11,4,1,15),
casmin_mother= c(3,3,2,1,2,1,3,2,1,2),
sample_year = c(2000, 2005, 2000,2005, 2000, 2005, 2000, 2005, 2000, 2005))
I asked this question here before in different form and got the following answer which sadly didn't work (which I only found out yesterday due to holidays, so posting on the old question would be awkward I think?):
df %>%
group_by(sample_year) %>%
mutate(fifteen= (sample_year == 15)/n()) %>%
group_by(sample_year, casmin_mother) %>%
summarise(fifteen_prop = sum(fifteen))
This would create the table in a different form which is no problem, but the percentages are off. I think it might have to do with n()
using the complete column and not the column sorted by sample year, but I really couldn't figure it out.
I also tried various aggregate
, filter
and if_any
ideas, but they all failed in (at least) one regard.
Also I have a weight variable for each row, so if that could be included that would be a nice bonus, but I think I could figure that out once I have the "base table".
I'm sorry for the (sort of) repost and hope anyone can help me with this.
Thanks a lot!
CodePudding user response:
Is this what you mean?
> aggregate(df, years ~ sample_year casmin_mother, FUN = function(x) 100 * sum (x == 15) / length(x))
sample_year casmin_mother years
1 2000 1 0
2 2005 1 100
3 2000 2 50
4 2005 2 50
5 2000 3 50
6 2005 3 0
To get to the structure you showed, you can use the reshape2 package.
> ds <- aggregate(df, years ~ sample_year casmin_mother, FUN = function(x) 100 * sum(x == 15) / length(x))
> dcast(ds, casmin_mother ~ sample_year, value.var = 'years')
casmin_mother 2000 2005
1 1 0 100
2 2 50 50
3 3 50 0