I want to keep distinct rows on a data frame, with an algorithm that chooses the last value per group (as dplyr::distinct()
does by default), but only if it's not NA
. I've seen this great answer on SO that relies on data.table
, but I can't scale it to data with more than one grouping variable.
To demonstrate the problem, I start with the minimal example that does work, then scale it up. So first, consider the following data:
library(tibble)
df_id_and_type <-
tibble::tribble(
~id, ~type,
1, "A",
1, NA,
2, "B",
3, "A",
3, NA,
3, "D",
3, NA,
4, NA,
4, "C",
5, "A",
6, NA,
6, "B",
6, NA
)
I want to get the distinct type
values per id
, by choosing the last value unless it's NA
. If the last is NA
then go up until there's non-NA
. So this answer shows us how to do it with data.table
:
library(data.table)
dt_id_and_type <- as.data.table(df_id_and_type)
dt_id_and_type$typena <- is.na(dt_id_and_type$type)
setorderv(dt_id_and_type, c("typena","id"), order = c(-1, 1))
dt_id_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE]
#> id type
#> 1: 1 A
#> 2: 2 B
#> 3: 3 D
#> 4: 4 C
#> 5: 5 A
#> 6: 6 B
But what to do if we have more than one grouping variable (i.e, not only id
)? In the following example I add a year
variable:
df_id_year_and_type <-
df_id_and_type %>%
add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012),
.before = "type")
df_id_year_and_type
#> # A tibble: 13 x 3
#> id year type
#> <dbl> <dbl> <chr>
#> 1 1 2002 A
#> 2 1 2002 <NA>
#> 3 2 2008 B
#> 4 3 2010 A
#> 5 3 2010 <NA>
#> 6 3 2010 D
#> 7 3 2013 <NA>
#> 8 4 2020 <NA>
#> 9 4 2020 C
#> 10 5 2009 A
#> 11 6 2010 <NA>
#> 12 6 2010 B
#> 13 6 2012 <NA>
My expected output would be:
## # A tibble: 8 x 3
## id year type
## <dbl> <dbl> <chr>
## 1 1 2002 A
## 2 2 2008 B
## 3 3 2010 D
## 4 3 2013 NA # for id 3 in year 2013 there was only `NA`, so that's what we get
## 5 4 2020 C
## 6 5 2009 A
## 7 6 2010 B
## 8 6 2012 NA # same as comment above
Any idea how I could scale the solution that worked in 1-grouping-var case to the current data? The first 2 lines of code are no-brainer:
dt_id_year_and_type <- as.data.table(df_id_year_and_type)
dt_id_year_and_type$typena <- is.na(dt_id_year_and_type$type)
setorderv(dt_id_year_and_type, c("typena","id"), order = c(-1, 1)) # <--- how to account for `year`?
dt_id_year_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE] # <--- here too...
CodePudding user response:
Here some data.table-based solutions.
setDT(df_id_year_and_type)
method 1
na.omit(df_id_year_and_type, cols="type")
drops NA
rows based on column type
.
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)
finds all the groups.
And by joining them (using the last match: mult="last"
), we obtain the desired output.
na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]
# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>
method 2
df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]
method 3
(likely slower because of [
overhead)
df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]
CodePudding user response:
I would propose this solution in which you exclude the unwanted rows prior to unique
. If all observations for a group are NA
, sum(is.na(x)) / .N
is equal to 1 and we proceed from there
library(tibble)
library(data.table)
df_id_and_type <-
tibble::tribble(
~id, ~type,
1, "A",
1, NA,
2, "B",
3, "A",
3, NA,
3, "D",
3, NA,
4, NA,
4, "C",
5, "A",
6, NA,
6, "B",
6, NA
)
df_id_year_and_type <-
df_id_and_type %>%
add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012),
.before = "type")
# convert to data.table
dt_id_year_and_type <- as.data.table(df_id_year_and_type)
# define grouping vars
grouping_vars <- c("id", "year")
# are all types na for a group?
dt_id_year_and_type[, na_ratio := sum(is.na(type)) / .N,
by = c(grouping_vars)]
# remove all lines that are NA, except they are from a group in which all
# observations are NA
dt_id_year_and_type <- dt_id_year_and_type[na_ratio == 1 | !is.na(type)]
# sort correctly
setorderv(dt_id_year_and_type, grouping_vars)
dt_id_year_and_type
#> id year type na_ratio
#> 1: 1 2002 A 0.5000000
#> 2: 2 2008 B 0.0000000
#> 3: 3 2010 A 0.3333333
#> 4: 3 2010 D 0.3333333
#> 5: 3 2013 <NA> 1.0000000
#> 6: 4 2020 C 0.5000000
#> 7: 5 2009 A 0.0000000
#> 8: 6 2010 B 0.5000000
#> 9: 6 2012 <NA> 1.0000000
# keep only the last observation of each group
dt_unique <- unique(dt_id_year_and_type, by = grouping_vars, fromLast = TRUE)
remove no longer needed helper column
dt_unique[, na_ratio := NULL]
dt_unique
#> id year type
#> 1: 1 2002 A
#> 2: 2 2008 B
#> 3: 3 2010 D
#> 4: 3 2013 <NA>
#> 5: 4 2020 C
#> 6: 5 2009 A
#> 7: 6 2010 B
#> 8: 6 2012 <NA>
CodePudding user response:
Another possible solution:
library(tidyverse)
df_id_year_and_type %>%
group_by(id, year) %>%
fill(type, .direction = "downup") %>%
summarise(type = last(type), .groups = "drop")
#> # A tibble: 8 × 3
#> id year type
#> <dbl> <dbl> <chr>
#> 1 1 2002 A
#> 2 2 2008 B
#> 3 3 2010 D
#> 4 3 2013 <NA>
#> 5 4 2020 C
#> 6 5 2009 A
#> 7 6 2010 B
#> 8 6 2012 <NA>
CodePudding user response:
library(dplyr)
A simple, easy to read example of the basic case is
df_id_and_type %>% filter(!is.na(type)) %>%
filter(id != lead(id) | id == max(id))
extending to the second criteria
df_id_year_and_type %>% filter(!is.na(type)) %>%
filter((id != lead(id) | id == max(id)) &
(year != lead(year) | year == max(year)))
It is clear and easy to understand. If you wish to retain the distinct groupings with no result you can either merge distinct back or insert another OR clause in the filters