Home > Back-end >  How to find the number of variables in common between variables two by two in a column R compared to
How to find the number of variables in common between variables two by two in a column R compared to

Time:04-14

I'm working on R, I have a character data table like that :

Species<- c("lion", "tiger", "lion","tiger","donkey","lion","donkey")
Countries <- c("Tanzania", "Tanzania", "Kenya","Kenya","Italia","Niger","France")

df <- data.frame(Species, Countries)
Species Countries
lion Tanzania
tiger Tanzania
lion Kenya
tiger Kenya
donkey Italia
lion Niger
donkey France

With this table, I would like to have for each country of my dataframe the number of species in common for the countries 2 by 2, and have a final table like this one:

Countries1 Countries2 number_common_species
Kenya Tanzania 2
Italia Tanzania 0
Italia Kenya 0
Niger Kenya 1
Niger Tanzania 1
Italia Niger 0
Italia France 1
Tanzania France 0
Kenya France 0
Niger France 0

I have a really big dataset with a lot of species.

Does anyone know how I can do this?

CodePudding user response:

First make a data frame of each pairwise combination of countries, then find the length of the set of species that each pair has in common:

counts <- data.frame(t(combn(unique(dat$Countries),2)))
counts$count <- apply(counts, 1, function(x) length(intersect( dat$Species[dat$Countries==x[1]], 
                                                               dat$Species[dat$Countries==x[2]])))
counts

         X1     X2 count
1  Tanzania  Kenya     2
2  Tanzania Italia     0
3  Tanzania  Niger     1
4  Tanzania France     0
5     Kenya Italia     0
6     Kenya  Niger     1
7     Kenya France     0
8    Italia  Niger     0
9    Italia France     1
10    Niger France     0

CodePudding user response:

This is harder than it looks, and hopefully someone has a more elegant solution:

df2 <- expand.grid(country2 = sort(unique(df$Countries)),
                   country1 = sort(unique(df$Countries)))[2:1]
cp <- crossprod(table(df$Species, df$Countries))
df2$common <- c(cp)
df2 <- df2[c(lower.tri(cp)),]

df2
#>    country1 country2 common
#> 2    France   Italia      1
#> 3    France    Kenya      0
#> 4    France    Niger      0
#> 5    France Tanzania      0
#> 8    Italia    Kenya      0
#> 9    Italia    Niger      0
#> 10   Italia Tanzania      0
#> 14    Kenya    Niger      1
#> 15    Kenya Tanzania      2
#> 20    Niger Tanzania      1

Created on 2022-04-13 by the reprex package (v2.0.1)

CodePudding user response:

Another base R option using crossprod as.table as.data.frame

m <- crossprod(table(df))
m[upper.tri(m, diag = TRUE)] <- NA
na.omit(as.data.frame(as.table(m)))

gives

   Countries Countries.1 Freq
2     Italia      France    1
3      Kenya      France    0
4      Niger      France    0
5   Tanzania      France    0
8      Kenya      Italia    0
9      Niger      Italia    0
10  Tanzania      Italia    0
14     Niger       Kenya    1
15  Tanzania       Kenya    2
20  Tanzania       Niger    1
  • Related