I have a dataset where I sampled abundance of 4 species across 12 months, 6 sites (5 replicates within a site). I am trying to calculate various summary stats (mean and sum) at the within site replicates.
For instance, in the dataset shows (df), in April -- Site #1, Species "A" shown, I want the sum of sp1.abundance
at the site/month/year. The sum is clearly close ~ 6
In test#1 When I use the complete dataset across all 12 months, sites etc; that value (sum.N) turns out to be 26.
In test#2, When I use the subset i.e. only month "Apr" and species "A", the sum is correct.
I run into similar problems for sum.nimi
I am unable to understand why there is a discrepancy. A reproducible example here is not the most helpful. I have a copy of the dataset here
df <- read.csv(file = "test1.csv")
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
1 2 Apr 1 A 3.000343 sp2.3.4 9.026369 27.082200
2 2 Apr 1 A 0.000000 sp2.3.4 0.000000 0.000000
3 2 Apr 1 A 0.000000 sp2.3.4 3.803798 0.000000
4 2 Apr 1 A 3.668878 sp2.3.4 3.388376 12.431541
5 2 Apr 1 A 0.000000 sp2.3.4 1.625994 0.000000
6 2 Apr 2 A 0.000000 sp2.3.4 8.308789 0.000000
7 2 Apr 2 A 1.029419 sp2.3.4 5.064659 5.213657
8 2 Apr 2 A 5.092077 sp2.3.4 1.002332 5.103951
9 2 Apr 2 A 0.000000 sp2.3.4 1.242110 0.000000
10 2 Apr 2 A 3.057688 sp2.3.4 1.427517 4.364901
11 2 Apr 3 A 5.450493 sp2.3.4 1.804613 9.836028
12 2 Apr 3 A 0.000000 sp2.3.4 1.998532 0.000000
13 2 Apr 3 A 4.799339 sp2.3.4 1.067276 5.122217
14 2 Apr 3 A 3.819911 sp2.3.4 4.503342 17.202368
15 2 Apr 3 A 0.000000 sp2.3.4 6.483943 0.000000
16 2 Apr 4 A 5.363487 sp2.3.4 1.543770 8.279988
17 2 Apr 4 A 5.062166 sp2.3.4 4.222809 21.376561
18 2 Apr 4 A 0.000000 sp2.3.4 4.847236 0.000000
19 2 Apr 4 A 1.707061 sp2.3.4 0.000000 0.000000
20 2 Apr 4 A 0.000000 sp2.3.4 2.415285 0.000000
df <- df %>%
group_by(year,month,site,species)%>%
dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))
year month site species mean.ni mean.mi sum.nimi sum.N
<int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 Apr 1 A 1.33 3.68 186. 26.7
2 2 Apr 2 A 1.84 3.93 153. 36.7
3 2 Apr 3 A 2.81 4.49 298. 56.3
4 2 Apr 4 A 2.43 3.77 261. 48.5
5 2 Apr 5 A 0 2.60 0 0
6 2 Apr 6 A 1.70 4.65 193. 34.1
7 2 Aug 1 A 0 0.791 0 0
8 2 Aug 2 A 0 0.590 0 0
9 2 Aug 3 A 0.717 0.795 38.6 14.3
10 2 Aug 4 A 0 0.366 0 0
11 2 Aug 5 A 0 0 0 0
12 2 Aug 6 A 0 0 0 0
13 1 Dec 1 A 0 1.29 0 0
14 1 Dec 2 A 1.50 1.12 88.9 30.0
15 1 Dec 3 A 1.12 2.41 57.0 22.4
16 1 Dec 4 A 0.660 0.495 32.7 13.2
17 1 Dec 5 A 0 1.54 0 0
18 1 Dec 6 A 2.28 3.44 291. 45.6
19 2 Feb 1 A 1.94 4.85 230. 38.9
20 2 Feb 2 A 1.62 3.06 180. 32.3
#Subset (test2)
df <- read.csv(file = "test2.csv")
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
1 2 Apr 1 A 3.000343 sp2.3.4 9.026369 27.082200
2 2 Apr 1 A 0.000000 sp2.3.4 0.000000 0.000000
3 2 Apr 1 A 0.000000 sp2.3.4 3.803798 0.000000
4 2 Apr 1 A 3.668878 sp2.3.4 3.388376 12.431541
5 2 Apr 1 A 0.000000 sp2.3.4 1.625994 0.000000
6 2 Apr 2 A 0.000000 sp2.3.4 8.308789 0.000000
7 2 Apr 2 A 1.029419 sp2.3.4 5.064659 5.213657
8 2 Apr 2 A 5.092077 sp2.3.4 1.002332 5.103951
9 2 Apr 2 A 0.000000 sp2.3.4 1.242110 0.000000
10 2 Apr 2 A 3.057688 sp2.3.4 1.427517 4.364901
11 2 Apr 3 A 5.450493 sp2.3.4 1.804613 9.836028
12 2 Apr 3 A 0.000000 sp2.3.4 1.998532 0.000000
13 2 Apr 3 A 4.799339 sp2.3.4 1.067276 5.122217
14 2 Apr 3 A 3.819911 sp2.3.4 4.503342 17.202368
15 2 Apr 3 A 0.000000 sp2.3.4 6.483943 0.000000
16 2 Apr 4 A 5.363487 sp2.3.4 1.543770 8.279988
17 2 Apr 4 A 5.062166 sp2.3.4 4.222809 21.376561
18 2 Apr 4 A 0.000000 sp2.3.4 4.847236 0.000000
19 2 Apr 4 A 1.707061 sp2.3.4 0.000000 0.000000
20 2 Apr 4 A 0.000000 sp2.3.4 2.415285 0.000000
df <- df %>%
group_by(year,month,site,species)%>%
dplyr::summarise(mean.ni = mean(sp1.abundance), mean.mi = mean(sp2.abundance), sum.nimi = sum(sp1.sp2), sum.N = sum(sp1.abundance))
# A tibble: 6 × 8
# Groups: year, month, site [6]
year month site species mean.ni mean.mi sum.nimi sum.N
<int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 Apr 1 A 1.33 3.57 39.5 6.67
2 2 Apr 2 A 1.84 3.41 14.7 9.18
3 2 Apr 3 A 2.81 3.17 32.2 14.1
4 2 Apr 4 A 2.43 2.61 29.7 12.1
5 2 Apr 5 A 0 3.47 0 0
6 2 Apr 6 A 1.70 4.49 37.5 8.52
[1]: https://i.stack.imgur.com/eCqBK.png
[2]: https://i.stack.imgur.com/lSyeW.png
[3]: https://i.stack.imgur.com/cJzAW.png
CodePudding user response:
I don't know how you processed your data, but it seems that you might somehow have made 5 copies of your species-1 abundance data. If I look at all of the entries from April/site 1/species A, the species-1 abundance column is {3.00, 0, 0, 3.67, 0} repeated 4 times (4*6.67 = 26.8; these values match your two discrepant values). There are a few other repeat patterns in the data (e.g. all columns except species2
are identical in rows {3, 13} and {9, 19}.
I suspect a messed-up merge at some point upstream.
library(tidyverse)
df1 <- read_csv(file = "test1.csv")
filter(df1, month == "Apr", site ==1, species == "A")
# A tibble: 20 × 8
year month site species sp1.abundance species2 sp2.abundance sp1.sp2
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 2 Apr 1 A 3.00 sp2.3.4 9.03 27.1
2 2 Apr 1 A 0 sp2.3.4 0 0
3 2 Apr 1 A 0 sp2.3.4 3.80 0
4 2 Apr 1 A 3.67 sp2.3.4 3.39 12.4
5 2 Apr 1 A 0 sp2.3.4 1.63 0
6 2 Apr 1 A 3.00 sp3.4.1 12.0 36.1
7 2 Apr 1 A 0 sp3.4.1 0 0
8 2 Apr 1 A 0 sp3.4.1 2.41 0
9 2 Apr 1 A 3.67 sp3.4.1 7.06 25.9
10 2 Apr 1 A 0 sp3.4.1 0 0
11 2 Apr 1 A 3.00 sp1.2.4 7.64 22.9
12 2 Apr 1 A 0 sp1.2.4 0 0
13 2 Apr 1 A 0 sp1.2.4 3.80 0
14 2 Apr 1 A 3.67 sp1.2.4 3.67 13.5
15 2 Apr 1 A 0 sp1.2.4 1.63 0
16 2 Apr 1 A 3.00 sp1.2.3 7.39 22.2
17 2 Apr 1 A 0 sp1.2.3 0 0
18 2 Apr 1 A 0 sp1.2.3 1.39 0
19 2 Apr 1 A 3.67 sp1.2.3 7.06 25.9
20 2 Apr 1 A 0 sp1.2.3 1.63 0