Group the column values uniquely in R-CodePudding

I have two columns title and text. I want to group the title by the number of texts it has received. Also, I want to uniquely group the title having the same name.

For e.g.

I have

title | text 
-------------
A     | I like...
B     | I wish...
C     | review1
C     | review2
C     | review3
D     | Detecting...
C     | review1
C     | review2
E     | New...

What I would like to have is:

title | text 
-------------
A     | I like...
B     | I wish...
C     | review1 review2 review3
D     | Detecting...
C     | review1 review2
E     | New...

What I have tried is:

df %>%
    filter(title %in% sample(unique(title))) %>%
    group_by(title) %>%
    select(title, text)

But it still does not give what I expect to have.

CodePudding user response：

Here's a dplyr approach. The key is to set group_by correctly so that it defines group based on row position AND values in the title column.

library(dplyr)

df %>% 
  group_by(gp = c(0, na.omit(cumsum(lead(title) != title)))) %>% 
  summarize(title = unique(title), text = paste0(text, collapse = " ")) %>% 
  select(-gp)

# A tibble: 6 × 2
  title text                   
  <chr> <chr>                  
1 A     I like...              
2 B     I wish...              
3 C     review1 review2 review3
4 D     Detecting...           
5 C     review1 review2        
6 E     New...

CodePudding user response：

Another base R way

tmp=rle(df$title)
df$grp=rep(1:length(tmp$lengths),tmp$lengths)
aggregate(text~title grp,data=df,FUN=paste0,collapse=" ")

  title grp                    text
1     A   1               I like...
2     B   2               I wish...
3     C   3 review1 review2 review3
4     D   4            Detecting...
5     C   5         review1 review2
6     E   6                  New...

CodePudding user response：

I guess you can try the following base R option using aggregate

aggregate(text ~. unique(df), toString)

CodePudding user response：

I'm not using dplyr, but base R could handle it:

do.line = function(a.title){
  return(c(a.title, paste(text[df$title == a.title], collapse = ' ' )))
}
t(sapply(unique(df$title), do.line))

CodePudding user response：

Pivoting comes to mind:

library(tidyverse)

# Build test data
df <- data.frame(title=c("A","B","C","C","C","D","C","C","E"),
                 text=c("I like...","I wish...","review1","review2","review3","Detecting","review1","review2","New..."))

# Combine all values in a list by pivoting
new_df <- df %>% pivot_wider(names_from=title, values_from=text, values_fn=list)

# Bring to desired format by pivoting back
new_df <- new_df %>% pivot_longer(cols=c(names(new_df)), names_to="title", values_to="text")

# Inspecting result
new_df

str(new_df)

# Example query
new_df %>%  filter(title=="C") %>% unlist()

CodePudding user response：

You can write a function that concatenates the unique values of a vector and use that after the group_by:

library(dplyr)

df <- data.frame(title = c('A','B','C','C','C','D','C','C','E'),
                 text = c('I like...', 'I wish...', 'review1','review2','review3',
                          'Detecting...','review1','review2', 'New...'))

unique_paste <- function(text_vec) {
  paste(unique(text_vec), collapse = " ")
}

df2 <- df %>% 
  mutate(id = cumsum(title != lag(title, default = 'A'))) %>% 
  group_by(id, title) %>% 
  do(text = unique_paste(.$text)) %>% 
  ungroup()