identifying non-duplicate rows in group except for id-CodePudding

I have a dataframe with a bunch of information (mostly character). I have 2 ID columns: one is linked to another dataframe and has unique values and one identifies rows which are identical or almost identical (a grouping ID). I want to create a dataframe in which all rows are the same within a group, except for the first ID column. Basically I have df1, and want to have df2. df1 has few missings and a few mismatched characters, which I want to clean up. Is there a way of identifying rows within a group which are not identical and replacing them (except for the unique id)?

df <- data.frame(id = c(1:10),
                 title = c("abc", "aac", "abc", "def", "def", "dde", "ghi", NA, "jkl", "mno"), 
                 id2 = c(1,1,1,2,2,2,3,4,4,5))


df2 <- data.frame(id = c(1:10),
                 title = c("abc", "abc", "abc", "def", "def", "def", "ghi", "jkl", "jkl", "mno"), 
                 id2 = c(1,1,1,2,2,2,3,4,4,5))

CodePudding user response：

You could simply replace the title column in each group by its first non-NA member:

library(tidyverse)

df %>%
  group_by(id2) %>%
  summarize(id = id, title = first(na.omit(title))) %>%
  select(c(2, 3, 1))
#> # A tibble: 10 x 3
#> # Groups:   id2 [5]
#>       id title   id2
#>    <int> <chr> <dbl>
#>  1     1 abc       1
#>  2     2 abc       1
#>  3     3 abc       1
#>  4     4 def       2
#>  5     5 def       2
#>  6     6 def       2
#>  7     7 ghi       3
#>  8     8 jkl       4
#>  9     9 jkl       4
#> 10    10 mno       5

CodePudding user response：

Another potential solution is to 'count' the number of each title and 'select' the most frequent title per group, e.g. from c("acc", "abc", "abc", "abb"), select "abc" (count = 2) and change "acc" and "abb" (count = 1 for both) to "abc".

With your example data:

library(dplyr)

df <- data.frame(id = c(1:10),
                 title = c("abc", "aac", "abc", "def", "def", "dde", "ghi", NA, "jkl", "mno"), 
                 id2 = c(1,1,1,2,2,2,3,4,4,5))

df %>%
  group_by(id2, title) %>%
  mutate(counter = n()) %>%
  group_by(id2) %>%
  mutate(title = ifelse(counter == max(counter), title, NA)) %>%
  mutate(title = vec_fill_missing(title, direction = "downup")) %>%
  select(-counter)
#> # A tibble: 10 × 3
#> # Groups:   id2 [5]
#>       id title   id2
#>    <int> <chr> <dbl>
#>  1     1 abc       1
#>  2     2 abc       1
#>  3     3 abc       1
#>  4     4 def       2
#>  5     5 def       2
#>  6     6 def       2
#>  7     7 ghi       3
#>  8     8 jkl       4
#>  9     9 jkl       4
#> 10    10 mno       5

^{Created on 2022-12-12 with reprex v2.0.2}