I have a dataframe with a bunch of information (mostly character). I have 2 ID columns: one is linked to another dataframe and has unique values and one identifies rows which are identical or almost identical (a grouping ID). I want to create a dataframe in which all rows are the same within a group, except for the first ID column. Basically I have df1, and want to have df2. df1 has few missings and a few mismatched characters, which I want to clean up. Is there a way of identifying rows within a group which are not identical and replacing them (except for the unique id)?
df <- data.frame(id = c(1:10),
title = c("abc", "aac", "abc", "def", "def", "dde", "ghi", NA, "jkl", "mno"),
id2 = c(1,1,1,2,2,2,3,4,4,5))
df2 <- data.frame(id = c(1:10),
title = c("abc", "abc", "abc", "def", "def", "def", "ghi", "jkl", "jkl", "mno"),
id2 = c(1,1,1,2,2,2,3,4,4,5))
CodePudding user response:
You could simply replace the title
column in each group by its first non-NA member:
library(tidyverse)
df %>%
group_by(id2) %>%
summarize(id = id, title = first(na.omit(title))) %>%
select(c(2, 3, 1))
#> # A tibble: 10 x 3
#> # Groups: id2 [5]
#> id title id2
#> <int> <chr> <dbl>
#> 1 1 abc 1
#> 2 2 abc 1
#> 3 3 abc 1
#> 4 4 def 2
#> 5 5 def 2
#> 6 6 def 2
#> 7 7 ghi 3
#> 8 8 jkl 4
#> 9 9 jkl 4
#> 10 10 mno 5
CodePudding user response:
Another potential solution is to 'count' the number of each title and 'select' the most frequent title per group, e.g. from c("acc", "abc", "abc", "abb")
, select "abc" (count = 2) and change "acc" and "abb" (count = 1 for both) to "abc".
With your example data:
library(dplyr)
df <- data.frame(id = c(1:10),
title = c("abc", "aac", "abc", "def", "def", "dde", "ghi", NA, "jkl", "mno"),
id2 = c(1,1,1,2,2,2,3,4,4,5))
df %>%
group_by(id2, title) %>%
mutate(counter = n()) %>%
group_by(id2) %>%
mutate(title = ifelse(counter == max(counter), title, NA)) %>%
mutate(title = vec_fill_missing(title, direction = "downup")) %>%
select(-counter)
#> # A tibble: 10 × 3
#> # Groups: id2 [5]
#> id title id2
#> <int> <chr> <dbl>
#> 1 1 abc 1
#> 2 2 abc 1
#> 3 3 abc 1
#> 4 4 def 2
#> 5 5 def 2
#> 6 6 def 2
#> 7 7 ghi 3
#> 8 8 jkl 4
#> 9 9 jkl 4
#> 10 10 mno 5
Created on 2022-12-12 with reprex v2.0.2