Home > front end >  Dplyr summarize "sum" function works correctly only for subset not the larger dataset in R
Dplyr summarize "sum" function works correctly only for subset not the larger dataset in R

Time:04-20

I have a dataset where I sampled abundance of 4 species across 12 months, 6 sites (5 replicates within a site). I am trying to calculate various summary stats (mean and sum) at the within site replicates.

For instance, in the dataset shows (df), in April -- Site #1, Species "A" shown, I want the sum of sp1.abundance at the site/month/year. The sum is clearly close ~ 6

In test#1 When I use the complete dataset across all 12 months, sites etc; that value (sum.N) turns out to be 26.

In test#2, When I use the subset i.e. only month "Apr" and species "A", the sum is correct.

I run into similar problems for sum.nimi

I am unable to understand why there is a discrepancy. A reproducible example here is not the most helpful. I have a copy of the dataset here

df <- read.csv(file = "test1.csv") 

   year month site species sp1.abundance species2 sp2.abundance   sp1.sp2
1     2   Apr    1       A      3.000343  sp2.3.4      9.026369 27.082200
2     2   Apr    1       A      0.000000  sp2.3.4      0.000000  0.000000
3     2   Apr    1       A      0.000000  sp2.3.4      3.803798  0.000000
4     2   Apr    1       A      3.668878  sp2.3.4      3.388376 12.431541
5     2   Apr    1       A      0.000000  sp2.3.4      1.625994  0.000000
6     2   Apr    2       A      0.000000  sp2.3.4      8.308789  0.000000
7     2   Apr    2       A      1.029419  sp2.3.4      5.064659  5.213657
8     2   Apr    2       A      5.092077  sp2.3.4      1.002332  5.103951
9     2   Apr    2       A      0.000000  sp2.3.4      1.242110  0.000000
10    2   Apr    2       A      3.057688  sp2.3.4      1.427517  4.364901
11    2   Apr    3       A      5.450493  sp2.3.4      1.804613  9.836028
12    2   Apr    3       A      0.000000  sp2.3.4      1.998532  0.000000
13    2   Apr    3       A      4.799339  sp2.3.4      1.067276  5.122217
14    2   Apr    3       A      3.819911  sp2.3.4      4.503342 17.202368
15    2   Apr    3       A      0.000000  sp2.3.4      6.483943  0.000000
16    2   Apr    4       A      5.363487  sp2.3.4      1.543770  8.279988
17    2   Apr    4       A      5.062166  sp2.3.4      4.222809 21.376561
18    2   Apr    4       A      0.000000  sp2.3.4      4.847236  0.000000
19    2   Apr    4       A      1.707061  sp2.3.4      0.000000  0.000000
20    2   Apr    4       A      0.000000  sp2.3.4      2.415285  0.000000

  df <- df %>% 
  group_by(year,month,site,species)%>% 
  dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))

    year month  site species mean.ni mean.mi sum.nimi sum.N
   <int> <chr> <int> <chr>     <dbl>   <dbl>    <dbl> <dbl>
 1     2 Apr       1 A         1.33    3.68     186.   26.7
 2     2 Apr       2 A         1.84    3.93     153.   36.7
 3     2 Apr       3 A         2.81    4.49     298.   56.3
 4     2 Apr       4 A         2.43    3.77     261.   48.5
 5     2 Apr       5 A         0       2.60       0     0  
 6     2 Apr       6 A         1.70    4.65     193.   34.1
 7     2 Aug       1 A         0       0.791      0     0  
 8     2 Aug       2 A         0       0.590      0     0  
 9     2 Aug       3 A         0.717   0.795     38.6  14.3
10     2 Aug       4 A         0       0.366      0     0  
11     2 Aug       5 A         0       0          0     0  
12     2 Aug       6 A         0       0          0     0  
13     1 Dec       1 A         0       1.29       0     0  
14     1 Dec       2 A         1.50    1.12      88.9  30.0
15     1 Dec       3 A         1.12    2.41      57.0  22.4
16     1 Dec       4 A         0.660   0.495     32.7  13.2
17     1 Dec       5 A         0       1.54       0     0  
18     1 Dec       6 A         2.28    3.44     291.   45.6
19     2 Feb       1 A         1.94    4.85     230.   38.9
20     2 Feb       2 A         1.62    3.06     180.   32.3

#Subset (test2)

 df <- read.csv(file = "test2.csv") 

       year month site species sp1.abundance species2 sp2.abundance   sp1.sp2
    1     2   Apr    1       A      3.000343  sp2.3.4      9.026369 27.082200
    2     2   Apr    1       A      0.000000  sp2.3.4      0.000000  0.000000
    3     2   Apr    1       A      0.000000  sp2.3.4      3.803798  0.000000
    4     2   Apr    1       A      3.668878  sp2.3.4      3.388376 12.431541
    5     2   Apr    1       A      0.000000  sp2.3.4      1.625994  0.000000
    6     2   Apr    2       A      0.000000  sp2.3.4      8.308789  0.000000
    7     2   Apr    2       A      1.029419  sp2.3.4      5.064659  5.213657
    8     2   Apr    2       A      5.092077  sp2.3.4      1.002332  5.103951
    9     2   Apr    2       A      0.000000  sp2.3.4      1.242110  0.000000
    10    2   Apr    2       A      3.057688  sp2.3.4      1.427517  4.364901
    11    2   Apr    3       A      5.450493  sp2.3.4      1.804613  9.836028
    12    2   Apr    3       A      0.000000  sp2.3.4      1.998532  0.000000
    13    2   Apr    3       A      4.799339  sp2.3.4      1.067276  5.122217
    14    2   Apr    3       A      3.819911  sp2.3.4      4.503342 17.202368
    15    2   Apr    3       A      0.000000  sp2.3.4      6.483943  0.000000
    16    2   Apr    4       A      5.363487  sp2.3.4      1.543770  8.279988
    17    2   Apr    4       A      5.062166  sp2.3.4      4.222809 21.376561
    18    2   Apr    4       A      0.000000  sp2.3.4      4.847236  0.000000
    19    2   Apr    4       A      1.707061  sp2.3.4      0.000000  0.000000
    20    2   Apr    4       A      0.000000  sp2.3.4      2.415285  0.000000

   df <- df %>% 
      group_by(year,month,site,species)%>% 
      dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))

    # A tibble: 6 × 8
    # Groups:   year, month, site [6]
       year month  site species mean.ni mean.mi sum.nimi sum.N
      <int> <chr> <int> <chr>     <dbl>   <dbl>    <dbl> <dbl>
    1     2 Apr       1 A          1.33    3.57     39.5  6.67
    2     2 Apr       2 A          1.84    3.41     14.7  9.18
    3     2 Apr       3 A          2.81    3.17     32.2 14.1 
    4     2 Apr       4 A          2.43    2.61     29.7 12.1 
    5     2 Apr       5 A          0       3.47      0    0   
    6     2 Apr       6 A          1.70    4.49     37.5  8.52

  [1]: https://i.stack.imgur.com/eCqBK.png
  [2]: https://i.stack.imgur.com/lSyeW.png
  [3]: https://i.stack.imgur.com/cJzAW.png

CodePudding user response:

I don't know how you processed your data, but it seems that you might somehow have made 5 copies of your species-1 abundance data. If I look at all of the entries from April/site 1/species A, the species-1 abundance column is {3.00, 0, 0, 3.67, 0} repeated 4 times (4*6.67 = 26.8; these values match your two discrepant values). There are a few other repeat patterns in the data (e.g. all columns except species2 are identical in rows {3, 13} and {9, 19}.

I suspect a messed-up merge at some point upstream.

library(tidyverse)
df1 <- read_csv(file = "test1.csv")
filter(df1, month == "Apr", site ==1, species == "A")
# A tibble: 20 × 8
    year month  site species sp1.abundance species2 sp2.abundance sp1.sp2
   <dbl> <chr> <dbl> <chr>           <dbl> <chr>            <dbl>   <dbl>
 1     2 Apr       1 A                3.00 sp2.3.4           9.03    27.1
 2     2 Apr       1 A                0    sp2.3.4           0        0  
 3     2 Apr       1 A                0    sp2.3.4           3.80     0  
 4     2 Apr       1 A                3.67 sp2.3.4           3.39    12.4
 5     2 Apr       1 A                0    sp2.3.4           1.63     0  
 6     2 Apr       1 A                3.00 sp3.4.1          12.0     36.1
 7     2 Apr       1 A                0    sp3.4.1           0        0  
 8     2 Apr       1 A                0    sp3.4.1           2.41     0  
 9     2 Apr       1 A                3.67 sp3.4.1           7.06    25.9
10     2 Apr       1 A                0    sp3.4.1           0        0  
11     2 Apr       1 A                3.00 sp1.2.4           7.64    22.9
12     2 Apr       1 A                0    sp1.2.4           0        0  
13     2 Apr       1 A                0    sp1.2.4           3.80     0  
14     2 Apr       1 A                3.67 sp1.2.4           3.67    13.5
15     2 Apr       1 A                0    sp1.2.4           1.63     0  
16     2 Apr       1 A                3.00 sp1.2.3           7.39    22.2
17     2 Apr       1 A                0    sp1.2.3           0        0  
18     2 Apr       1 A                0    sp1.2.3           1.39     0  
19     2 Apr       1 A                3.67 sp1.2.3           7.06    25.9
20     2 Apr       1 A                0    sp1.2.3           1.63     0  
  • Related