If i have this dataframe:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
'30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011',
'31-12-2011', '31-12-2012'),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA)
)
How can i go about creating a column which identifies what the most recent non NA rows are for that group of period and website?
So the final output would look like this:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010','30-11-2010', '31-12-2010',
'30-04-2011','31-07-2011','30-09-2011','30-11-2011', '31-12-2011',
'31-12-2011', '31-12-2012'),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, NA, 5, NA, NA, NA, NA, 10, 20, NA),
most_recent = c('no','yes', 'no', 'no', 'no', 'no', 'no','yes','yes','yes','yes','yes','yes','no')
)
I am trying to identify that when the first non-na value occurs for the group of period and website when sorted by the most recent date, then mark all the values for this period and website as "yes" in the most_recent column
So what you have is the following:
- google 2011END is the latest value by date so is a yes
- facebook 2011q1 to 2011END will have a yes value since there is one non-na value being 2011END and this is the most recent date and has a non-na value
- youtube 2011END is yes - since it is the first non-na value we occur when we sort by date as 2012 has no value it is a no value
I know it involves a group by but unsure where to go from there
This has been updated for clarity
CodePudding user response:
The code below selects the most recent non NA
row by website.
As this isn't fully your expected result, as suggested in comments please clarify your question if necessary.
data[,most_recent:=fifelse(!is.na(values)&date==.SD[!is.na(values),max(date)],'yes','no'),by=website][]
period date website values most_recent
1: 2010END 31-12-2010 google 1 no
2: 2011END 31-12-2011 google 2 yes
3: 2010Q1 30-04-2010 facebook 1 no
4: 2010Q2 31-07-2010 facebook 2 no
5: 2010Q3 30-09-2010 facebook 3 no
6: 2010Q4 30-11-2010 facebook NA no
7: 2010END 31-12-2010 facebook 5 no
8: 2011Q1 30-04-2011 facebook NA no
9: 2011Q2 31-07-2011 facebook NA no
10: 2011Q3 30-09-2011 facebook NA no
11: 2011Q4 30-11-2011 facebook NA no
12: 2011END 31-12-2011 facebook 10 yes
13: 2011END 31-12-2011 youtube 20 yes
14: 2012END 31-12-2012 youtube NA no