Home > Software engineering >  R: Merge rows that share same code and at least one or more strings in name-column
R: Merge rows that share same code and at least one or more strings in name-column

Time:10-13

I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:

    df <- data.frame(
        id = 1:8,
        name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
        code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)

The approach that I envision would look something like this:

  • use group_by on the code-column,
  • check if the group contains 2 or more rows,
  • check if there are any shared words among the different rows. If so, merge those rows and combine the information into a single row.

The final dataset would look like this:

final_df <- data.frame(
  id = c("1|2|3", 4:8),
  name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
  code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)

The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.

CodePudding user response:

You can use this helper function f(), and apply it to each group:

f <- function(d) {
  if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
    d = lapply(d,paste0,collapse="|")
  }
  return(d)
}

library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]

Output:

     code     id                      name
   <char> <char>                    <char>
1: 4564AB  1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC      4                     rhino
3: 7845BC      5                     hippo
4: 6144DE      6                  elephant
5: 7845KI      7             elephant bros
6: 7845EG      8                 last comp

CodePudding user response:

We could have a condition with if/else after grouping. Extract the words from the 'name' column and check for any intersecting elements, create a flag where the length of intersecting elements are greater than 0 and the group size (n()) is greater than 1 and use this to paste/str_c elements of the other columns

library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
  group_by(code = factor(code, levels = unique(code))) %>%
  mutate(flag = n() > 1 & 
   (str_extract_all(name, "\\w ") %>%
       reduce(intersect) %>%
       length %>%
       is_greater_than(0))) %>%
   summarise(across(-flag, ~ if(any(flag)) 
     str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
   select(names(df))

-output

# A tibble: 6 × 3
  id    name                      code  
  <chr> <chr>                     <fct> 
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4     rhino                     7845BC
3 5     hippo                     7845BC
4 6     elephant                  6144DE
5 7     elephant bros             7845KI
6 8     last comp                 7845EG

-OP's expected

> final_df
     id                      name   code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2     4                     rhino 7845BC
3     5                     hippo 7845BC
4     6                  elephant 6144DE
5     7             elephant bros 7845KI
6     8                 last comp 7845EG
  • Related