I want to repeat a particular query on a large dataset and I am sure the answer to my question is quite basic, but after reading various sources on 'for' loops, repeat and replicate functions for about 2 hours, I still can't find any examples which appear to do what I need to do.
The dataset contains survey data from particular sites which are split into plots and each plot contains multiple species entries so the data looks like this:
SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e
The actual data is over 6500 rows as there are hundreds of sites and each should contain 20 plots - the issue is some plots are missing from some sites, so what I need to do is establish how many plots are missing in total. I can use the following code to query how many unique plots are on each site so in the example below I query how many unique plots are in site number 7:
NROW(unique(df$PLOT[df$SITE=="7"]))
[20]
But I have hundreds of sites, so is there a function that will allow me to query each site automatically without manually changing the site number each time?
CodePudding user response:
Here is a base R way with tapply
.
x <- '
SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e'
df1 <- read.table(textConnection(x), header = TRUE)
num_plots <- with(df1, tapply(PLOT, SITE, \(x) length(unique(x))))
which(num_plots != max(num_plots))
#> 2
#> 2
Created on 2022-05-26 by the reprex package (v2.0.1)
CodePudding user response:
Not quite sure what you're going for but does this help?
Using data.table
:
df <- read.table(text='SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e', header=TRUE)
library(data.table)
setDT(df)[, .(plots=uniqueN(PLOT)), by=.(SITE)]
## SITE plots
## 1: 1 5
## 2: 2 4