Home > database >  Finding Combinations of two characters by id in a dataset in R
Finding Combinations of two characters by id in a dataset in R

Time:08-06

I have a dataset sorted by IDs and several fruits. What I want to do is detect all possible combinations of 2 fruits dependent on the ID without repetition (Apple-Banana combination should be the same as Banana-Apple).

As an example:

ID Fruit
1 Apple
1 Banana
1 Blueberry
2 Apple
3 Orange
3 Banana
3 Apple
3 Blueberry

What I want to create is:

ID Combination
1 Apple Banana
1 Apple Blueberry
1 Banana Blueberry
2 Apple
3 Banana Orange
3 Apple Orange
3 Blueberry Orange
3 Apple Banana
3 Banana Blueberry
3 Apple Blueberry

The example dataset:

ID <- c(1,1,1,2,3,3,3,3)
Fruit <- c("Apple","Banana","Blueberry","Apple","Orange","Banana","Apple","Blueberry")
dataset <- data.frame(ID, Fruit)

CodePudding user response:

With dplyr, you could use summarise combn:

library(dplyr)

dataset %>%
  group_by(ID) %>%
  summarise(Fruit = if(n() > 1) combn(Fruit, 2, simplify = FALSE) else list(Fruit))

# # A tibble: 10 × 2
# # Groups:   ID [3]
#       ID Fruit
#    <dbl> <list>   
#  1     1 <chr [2]>
#  2     1 <chr [2]>
#  3     1 <chr [2]>
#  4     2 <chr [1]>
#  5     3 <chr [2]>
#  6     3 <chr [2]>
#  7     3 <chr [2]>
#  8     3 <chr [2]>
#  9     3 <chr [2]>
# 10     3 <chr [2]>

where Fruit is a list-column containing each pair of fruits for each ID.


If you want to collapse each element of the list to end up with a character vector-column, just add FUN = toString into combn().

(Notice the difference of the else statements for both methods, the former is else list(Fruit) and the latter is just else Fruit)

dataset %>%
  group_by(ID) %>%
  summarise(Fruit = if(n() > 1) combn(Fruit, 2, FUN = toString) else Fruit)

# # A tibble: 10 × 2
# # Groups:   ID [3]
#       ID Fruit
#    <dbl> <chr>
#  1     1 Apple, Banana
#  2     1 Apple, Blueberry
#  3     1 Banana, Blueberry
#  4     2 Apple
#  5     3 Orange, Banana
#  6     3 Orange, Apple    
#  7     3 Orange, Blueberry
#  8     3 Banana, Apple
#  9     3 Banana, Blueberry
# 10     3 Apple, Blueberry

CodePudding user response:

Here is a base R option

with(
  dataset,
  rev(stack(by(Fruit, ID, function(x) as.vector(combn(x, pmin(2, length(x)), toString)))))
)

which gives

   ind            values
1    1     Apple, Banana
2    1  Apple, Blueberry
3    1 Banana, Blueberry
4    2             Apple
5    3    Orange, Banana
6    3     Orange, Apple
7    3 Orange, Blueberry
8    3     Banana, Apple
9    3 Banana, Blueberry
10   3  Apple, Blueberry

CodePudding user response:

This is for reference.

uniID=unique(dataset$ID)
res=NULL
for (id in 1:length(uniID))
{
    sameIDdf=dataset[dataset$ID==id, ]
    x=nrow(sameIDdf)
    print(x)
    if (x>1)
    {
       comb=t(combn(1:x, 2))
       for (i in 1:nrow(comb))
       {
         res=rbind(res, data.frame(ID=id, Combination=paste(sameIDdf[comb[i,1], 'Fruit'], sameIDdf[comb[i,2], 'Fruit'])))
       }
    } else
    {
        res=rbind(res, data.frame(ID=id,Combination=sameIDdf[1,'Fruit']))
    }    
}
res

Result:

ID  Combination
<int>   <fct>
1   Apple Banana
1   Apple Blueberry
1   Banana Blueberry
2   Apple
3   Orange Banana
3   Orange Apple
3   Orange Blueberry
3   Banana Apple
3   Banana Blueberry
3   Apple Blueberry 
  • Related