Home > Net >  identifying non-duplicate rows in group except for id
identifying non-duplicate rows in group except for id

Time:12-12

I have a dataframe with a bunch of information (mostly character). I have 2 ID columns: one is linked to another dataframe and has unique values and one identifies rows which are identical or almost identical (a grouping ID). I want to create a dataframe in which all rows are the same within a group, except for the first ID column. Basically I have df1, and want to have df2. df1 has few missings and a few mismatched characters, which I want to clean up. Is there a way of identifying rows within a group which are not identical and replacing them (except for the unique id)?

df <- data.frame(id = c(1:10),
                 title = c("abc", "aac", "abc", "def", "def", "dde", "ghi", NA, "jkl", "mno"), 
                 id2 = c(1,1,1,2,2,2,3,4,4,5))


df2 <- data.frame(id = c(1:10),
                 title = c("abc", "abc", "abc", "def", "def", "def", "ghi", "jkl", "jkl", "mno"), 
                 id2 = c(1,1,1,2,2,2,3,4,4,5))

CodePudding user response:

You could simply replace the title column in each group by its first non-NA member:

library(tidyverse)

df %>%
  group_by(id2) %>%
  summarize(id = id, title = first(na.omit(title))) %>%
  select(c(2, 3, 1))
#> # A tibble: 10 x 3
#> # Groups:   id2 [5]
#>       id title   id2
#>    <int> <chr> <dbl>
#>  1     1 abc       1
#>  2     2 abc       1
#>  3     3 abc       1
#>  4     4 def       2
#>  5     5 def       2
#>  6     6 def       2
#>  7     7 ghi       3
#>  8     8 jkl       4
#>  9     9 jkl       4
#> 10    10 mno       5

CodePudding user response:

Another potential solution is to 'count' the number of each title and 'select' the most frequent title per group, e.g. from c("acc", "abc", "abc", "abb"), select "abc" (count = 2) and change "acc" and "abb" (count = 1 for both) to "abc".

With your example data:

library(dplyr)

df <- data.frame(id = c(1:10),
                 title = c("abc", "aac", "abc", "def", "def", "dde", "ghi", NA, "jkl", "mno"), 
                 id2 = c(1,1,1,2,2,2,3,4,4,5))

df %>%
  group_by(id2, title) %>%
  mutate(counter = n()) %>%
  group_by(id2) %>%
  mutate(title = ifelse(counter == max(counter), title, NA)) %>%
  mutate(title = vec_fill_missing(title, direction = "downup")) %>%
  select(-counter)
#> # A tibble: 10 × 3
#> # Groups:   id2 [5]
#>       id title   id2
#>    <int> <chr> <dbl>
#>  1     1 abc       1
#>  2     2 abc       1
#>  3     3 abc       1
#>  4     4 def       2
#>  5     5 def       2
#>  6     6 def       2
#>  7     7 ghi       3
#>  8     8 jkl       4
#>  9     9 jkl       4
#> 10    10 mno       5

Created on 2022-12-12 with reprex v2.0.2

  • Related