I'm working on R, I have a character data table like that :
Species<- c("lion", "tiger", "lion","tiger","donkey","lion","donkey")
Countries <- c("Tanzania", "Tanzania", "Kenya","Kenya","Italia","Niger","France")
df <- data.frame(Species, Countries)
Species | Countries |
---|---|
lion | Tanzania |
tiger | Tanzania |
lion | Kenya |
tiger | Kenya |
donkey | Italia |
lion | Niger |
donkey | France |
With this table, I would like to have for each country of my dataframe the number of species in common for the countries 2 by 2, and have a final table like this one:
Countries1 | Countries2 | number_common_species |
---|---|---|
Kenya | Tanzania | 2 |
Italia | Tanzania | 0 |
Italia | Kenya | 0 |
Niger | Kenya | 1 |
Niger | Tanzania | 1 |
Italia | Niger | 0 |
Italia | France | 1 |
Tanzania | France | 0 |
Kenya | France | 0 |
Niger | France | 0 |
I have a really big dataset with a lot of species.
Does anyone know how I can do this?
CodePudding user response:
First make a data frame of each pairwise combination of countries, then find the length of the set of species that each pair has in common:
counts <- data.frame(t(combn(unique(dat$Countries),2)))
counts$count <- apply(counts, 1, function(x) length(intersect( dat$Species[dat$Countries==x[1]],
dat$Species[dat$Countries==x[2]])))
counts
X1 X2 count
1 Tanzania Kenya 2
2 Tanzania Italia 0
3 Tanzania Niger 1
4 Tanzania France 0
5 Kenya Italia 0
6 Kenya Niger 1
7 Kenya France 0
8 Italia Niger 0
9 Italia France 1
10 Niger France 0
CodePudding user response:
This is harder than it looks, and hopefully someone has a more elegant solution:
df2 <- expand.grid(country2 = sort(unique(df$Countries)),
country1 = sort(unique(df$Countries)))[2:1]
cp <- crossprod(table(df$Species, df$Countries))
df2$common <- c(cp)
df2 <- df2[c(lower.tri(cp)),]
df2
#> country1 country2 common
#> 2 France Italia 1
#> 3 France Kenya 0
#> 4 France Niger 0
#> 5 France Tanzania 0
#> 8 Italia Kenya 0
#> 9 Italia Niger 0
#> 10 Italia Tanzania 0
#> 14 Kenya Niger 1
#> 15 Kenya Tanzania 2
#> 20 Niger Tanzania 1
Created on 2022-04-13 by the reprex package (v2.0.1)
CodePudding user response:
Another base R option using crossprod
as.table
as.data.frame
m <- crossprod(table(df))
m[upper.tri(m, diag = TRUE)] <- NA
na.omit(as.data.frame(as.table(m)))
gives
Countries Countries.1 Freq
2 Italia France 1
3 Kenya France 0
4 Niger France 0
5 Tanzania France 0
8 Kenya Italia 0
9 Niger Italia 0
10 Tanzania Italia 0
14 Niger Kenya 1
15 Tanzania Kenya 2
20 Tanzania Niger 1