Create a presence-absence matrix based on association data-CodePudding

I have referred to Create a presence/absence column based on presence records [duplicate], Create a presence-absence matrix with presence on specific dates, and Presence-absence matrix but continually run into the issue of my species-association columns.

Using a large, longitudinal dataset based on primate behavior, I have created a species/association table. I have an extra column, variable, likely due to my attempt at grouping "community_id". All reproducible subset of my dataset is seen below.

dput data -

    data <- structure(list(Species = c("BABO", "BW", "RC", "BW", "RC", "SKS", 
"SKS", "RC", "RC", "SKS", "BW", "RC", "RC", "RC", "RC", "SKS", 
"RC", "SKS", "SKS", "RC"), Association = c(NA, "SKS", NA, "RC", 
"BW", "SKS", NA, NA, NA, "BW", "SKS", NA, "SKS", "BW", "SKS", 
NA, NA, "SKS", NA, "MANG"), variable = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = "community_id", class = "factor"), community_id = c("2007-4-16.C3", 
"2007-4-16.C3", "2007-4-16.C3", "2007-4-17.Mwani", "2007-4-17.Mwani", 
"2007-4-17.Mwani", "2007-4-17.Mwani", "2007-4-18.Sanje", "2007-4-18.Sanje", 
"2007-4-18.Sanje", "2007-4-18.Sanje", "2007-5-8.C3", "2007-5-9.Mwani", 
"2007-5-9.Mwani", "2007-5-9.Mwani", "2007-5-10.Sanje", "2007-5-10.Sanje", 
"2007-6-6.C3", "2007-6-6.C3", "2007-6-6.C3")), row.names = c(NA, 
20L), class = "data.frame")

Output -

   Species  Association  variable       community_id
   <chr>    <chr>        <chr>          <chr>
1   BABO    NA           community_id   2007-4-16.C3
2   BW      SKS          community_id   2007-4-16.C3
3   RC      NA           community_id   2007-4-16.C3
4   BW      RC           community_id   2007-4-17.Mwani
5   RC      BW           community_id   2007-4-17.Mwani
6   SKS     SKS          community_id   2007-4-17.Mwani
7   SKS     NA           community_id   2007-4-17.Mwani
8   RC      NA           community_id   2007-4-18.Sanje
9   RC      NA           community_id   2007-4-18.Sanje
10  SKS     BW           community_id   2007-4-18.Sanje
11  BW      SKS          community_id   2007-4-18.Sanje
12  RC      NA           community_id   2007-5-8.C3
13  RC      SKS          community_id   2007-5-9.Mwani
14  RC      BW           community_id   2007-5-9.Mwani
15  RC      SKS          community_id   2007-5-9.Mwani
16  SKS     NA           community_id   2007-5-10.Sanje
17  RC      NA           community_id   2007-5-10.Sanje
18  SKS     SKS          community_id   2007-6-6.C3
19  SKS     NA           community_id   2007-6-6.C3
20  RC      MANG         community_id   2007-6-6.C3

I need a presence-absense matrix grouped by the community_id. I tried to group by the column "community_id" already, which is where I believe I created the additional and seemingly irrelevant "variable" column. I am looking for the below output -

community_id         BABO    BW     RC     SKS    Mang
<chr>                <chr>   <chr>  <chr>  <chr>  <chr>
2007-4-16.C3         1       1       1      1      0
2007-4-17.Mwani      0       1       1      1      0
2007-4-18.Sanje      0       1       1      1      0
2007-5-8.C3          0       0       1      0      0 
2007-5-9.Mwani       0       1       1      1      0
2007-5-10.Sanje      0       0       1      1      0
2007-6-6.C3          0       0       1      1      1

Any advice or assistance is greatly appreciated! Have a lovely day.

CodePudding user response：

You could do:

library(dplyr)

data %>%
  group_by(community_id) %>%
  summarize(as_tibble(t(sapply(c("BABO", "BW", "RC", "SKS", "Mang"),
                               function(x) as.numeric(x %in% Species)))))
#> # A tibble: 7 x 6
#>   community_id     BABO    BW    RC   SKS  Mang
#>   <chr>           <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2007-4-16.C3        1     1     1     0     0
#> 2 2007-4-17.Mwani     0     1     1     1     0
#> 3 2007-4-18.Sanje     0     1     1     1     0
#> 4 2007-5-10.Sanje     0     0     1     1     0
#> 5 2007-5-8.C3         0     0     1     0     0
#> 6 2007-5-9.Mwani      0     0     1     0     0
#> 7 2007-6-6.C3         0     0     1     1     0

^{Created on 2023-01-19 with reprex v2.0.2}

CodePudding user response：

A solution using pivot_wider, combining variables Species and Association first.

library(tidyr)
library(dplyr)

rbind(as.matrix(data[, -2]), as.matrix(data[, -1])) %>%
  as_tibble() %>% 
  distinct() %>% 
  na.omit() %>% 
  pivot_wider(community_id, names_from=Species, values_from=Species, 
    values_fn=function(x) any(unique(x) == x) * 1, values_fill = 0)
# A tibble: 7 × 6
  community_id     BABO    BW    RC   SKS  MANG
  <chr>           <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-4-16.C3        1     1     1     1     0
2 2007-4-17.Mwani     0     1     1     1     0
3 2007-4-18.Sanje     0     1     1     1     0
4 2007-5-8.C3         0     0     1     0     0
5 2007-5-9.Mwani      0     1     1     1     0
6 2007-5-10.Sanje     0     0     1     1     0
7 2007-6-6.C3         0     0     1     1     1

CodePudding user response：

Basically you want to look if the complete species spc are uniquely %in% the communities.

ufun <- \(x) unique(na.omit(unlist(x)))  ## helper function
# spc <- ufun(data[1:2])  ## this might work on your complete data
spc <- c('BABO', 'BW', 'RC', 'SKS', 'Mang')  ## here hard coded

by(data[1:2], data$community_id, \(x) setNames( (spc %in% ufun(x)), spc)) |>
  do.call(what=rbind)
#                 BABO BW RC SKS Mang
# 2007-4-16.C3       1  1  1   1    0
# 2007-4-17.Mwani    0  1  1   1    0
# 2007-4-18.Sanje    0  1  1   1    0
# 2007-5-10.Sanje    0  0  1   1    0
# 2007-5-8.C3        0  0  1   0    0
# 2007-5-9.Mwani     0  1  1   1    0
# 2007-6-6.C3        0  0  1   1    0

CodePudding user response：

Base R solution

Make a table() of unique() combinations of community_id and Species:

table(unique(data[c("community_id", "Species")]))

                 Species
community_id      BABO BW RC SKS
  2007-4-16.C3       1  1  1   0
  2007-4-17.Mwani    0  1  1   1
  2007-4-18.Sanje    0  1  1   1
  2007-5-10.Sanje    0  0  1   1
  2007-5-8.C3        0  0  1   0
  2007-5-9.Mwani     0  0  1   0
  2007-6-6.C3        0  0  1   1

tidyverse solution

First subset to distinct() values of community_id and Species; then create a Present variable set to 1 for all observations; then pivot_longer(), using the values_fill arg to add 0s for community-species combinations that aren’t observed.

library(dplyr)
library(tidyr)

data %>%
  distinct(community_id, Species) %>%
  mutate(Present = 1L) %>%
  pivot_wider(
    names_from = Species,
    values_from = Present,
    values_fill = 0L
  )

# A tibble: 7 × 5
  community_id     BABO    BW    RC   SKS
  <chr>           <int> <int> <int> <int>
1 2007-4-16.C3        1     1     1     0
2 2007-4-17.Mwani     0     1     1     1
3 2007-4-18.Sanje     0     1     1     1
4 2007-5-8.C3         0     0     1     0
5 2007-5-9.Mwani      0     0     1     0
6 2007-5-10.Sanje     0     0     1     1
7 2007-6-6.C3         0     0     1     1