Home > OS >  How to repeat query on different parts of a dataset in R?
How to repeat query on different parts of a dataset in R?

Time:05-26

I want to repeat a particular query on a large dataset and I am sure the answer to my question is quite basic, but after reading various sources on 'for' loops, repeat and replicate functions for about 2 hours, I still can't find any examples which appear to do what I need to do.

The dataset contains survey data from particular sites which are split into plots and each plot contains multiple species entries so the data looks like this:

SITE    PLOT    SPECIES
1          1    a
1          1    b
1          2    a
1          2    c
1          3    b
1          3    c
1          3    d
1          4    a
1          5    a
1          5    b
2          1    b
2          1    c
2          3    a
2          3    b
2          4    b
2          4    c
2          4    d
2          5    e

The actual data is over 6500 rows as there are hundreds of sites and each should contain 20 plots - the issue is some plots are missing from some sites, so what I need to do is establish how many plots are missing in total. I can use the following code to query how many unique plots are on each site so in the example below I query how many unique plots are in site number 7:

NROW(unique(df$PLOT[df$SITE=="7"]))

   [20]

But I have hundreds of sites, so is there a function that will allow me to query each site automatically without manually changing the site number each time?

CodePudding user response:

Here is a base R way with tapply.

x <- '
SITE    PLOT    SPECIES
1          1    a
1          1    b
1          2    a
1          2    c
1          3    b
1          3    c
1          3    d
1          4    a
1          5    a
1          5    b
2          1    b
2          1    c
2          3    a
2          3    b
2          4    b
2          4    c
2          4    d
2          5    e'
df1 <- read.table(textConnection(x), header = TRUE)

num_plots <- with(df1, tapply(PLOT, SITE, \(x) length(unique(x))))
which(num_plots != max(num_plots))
#> 2 
#> 2

Created on 2022-05-26 by the reprex package (v2.0.1)

CodePudding user response:

Not quite sure what you're going for but does this help?

Using data.table:

df <- read.table(text='SITE    PLOT    SPECIES
1          1    a
1          1    b
1          2    a
1          2    c
1          3    b
1          3    c
1          3    d
1          4    a
1          5    a
1          5    b
2          1    b
2          1    c
2          3    a
2          3    b
2          4    b
2          4    c
2          4    d
2          5    e', header=TRUE)
library(data.table)
setDT(df)[, .(plots=uniqueN(PLOT)), by=.(SITE)]
##    SITE plots
## 1:    1     5
## 2:    2     4
  • Related