Home > Net >  Selecting rows from a data.table using an integer value not working
Selecting rows from a data.table using an integer value not working

Time:11-07

I have large numbers of biological recording visits (c350K - hence use of data.table), the details of which include the date and coordinates. I want to lookup the weather at the time and place of each visit. Such data are available from MetOffice gridded daily weather data in which the values for each month are held in a .nc files. Therefore, I extract the year and month from the date, loop though the visits by year and month, load the relevant .nc file (using raster and ncdf4 packages) for each year and month and then extract the data for the appropriate days and coordinates. However, I have run into a problem - selecting visits by year and month is not working.

Here is some example code illustrating the problem:

library(data.table)
set.seed(21)
v <- data.table(dt=sample(seq(as.Date('2010/01/01'), 
                              as.Date('2020/12/31'), by="day"), 50),
                x=sample(seq(100,600),50),
                y=sample(seq(0,1000),50))

v[ ,year:=as.integer(format(dt, "%Y"))]
v[ ,mon:=as.integer(format(dt, "%m"))]
v[ ,day:=as.integer(format(dt, "%d"))]

for(y in min(v$year):max(v$year)){
  for(m in 1:12){
    x <- v[year==y & mon==m, ]
    if(nrow(x)>0){
      print(paste(y,m,nrow(x)))
    }
  }
}

The result is - nothing is printed. x <- v[year==y & mon==m, ] does not select any rows. To illustrate this further, consider the following:

> nrow(v[year==2020 & mon==9, ])
[1] 2
> y<-2020
> m<-9
> nrow(v[year==y & mon==m, ])
[1] 0
>

Examining the data.table, the year, mon and day variables are "int" and so are variables y and m. I don't understand what is going on here!

CodePudding user response:

When evaluating symbols in the expression, data.table first looks in the data and then, if not found, looks in the enclosing environment. In your first expression, year == y, it finds both year and y in the local data (columns 4 and 3, respectively), so it uses both of them. Of course, none of your y frame values are the same as the year.

There are several alternatives, more so found in the not-yet-released data.table-1.14.5 and its env= argument. Lacking that, see that

y <- 2020
m <- 9
v[year==2020 & mon==9,]
#            dt     x     y  year   mon   day
#        <Date> <int> <int> <int> <int> <int>
# 1: 2020-09-01   549   196  2020     9     1
# 2: 2020-09-18   525     2  2020     9    18
v[year==y & mon==m,]
# Empty data.table (0 rows and 6 cols): dt,x,y,year,mon,day
v[, .SD[year==..y & mon==m,]]
#            dt     x     y  year   mon   day
#        <Date> <int> <int> <int> <int> <int>
# 1: 2020-09-01   549   196  2020     9     1
# 2: 2020-09-18   525     2  2020     9    18

The ..y trick works in the j= (second) position, not the i= (first), so the hack of using .SD[..].

CodePudding user response:

Apart from r2evans approach we can always use get(), but we need to specify the envir argument correctly.

library(data.table)

set.seed(21)

v <- data.table(dt=sample(seq(as.Date('2010/01/01'), 
                              as.Date('2020/12/31'), by="day"), 50),
                x=sample(seq(100,600),50),
                y=sample(seq(0,1000),50))

v[ ,year:=as.integer(format(dt, "%Y"))]
v[ ,mon:=as.integer(format(dt, "%m"))]
v[ ,day:=as.integer(format(dt, "%d"))]

for(y in min(v$year):max(v$year)) {
  for(m in 1:12) {
    x <- v[year == get("y", envir = globalenv()) & mon == get("m", envir = globalenv()), ]
    if(nrow(x)>0){
      print(paste(y,m,nrow(x)))
    }
  }
}

#> [1] "2010 6 1"
#> [1] "2011 1 2"
#> [1] "2011 2 1"
#> [1] "2011 5 1"
#> [1] "2011 8 1"
#> [1] "2011 11 1"
#> [1] "2012 1 1"
#> [1] "2012 2 1"
#> [1] "2012 8 1"
#> [1] "2013 12 2"
#> [1] "2014 1 2"
#> [1] "2014 6 1"
#> [1] "2014 7 2"
#> [1] "2014 8 3"
#> [1] "2014 9 1"
#> [1] "2014 12 1"
#> [1] "2015 1 1"
#> [1] "2015 2 1"
#> [1] "2015 3 1"
#> [1] "2015 10 1"
#> [1] "2016 4 1"
#> [1] "2016 6 1"
#> [1] "2016 8 1"
#> [1] "2016 10 1"
#> [1] "2017 1 2"
#> [1] "2017 4 1"
#> [1] "2017 6 1"
#> [1] "2017 8 1"
#> [1] "2017 11 2"
#> [1] "2018 2 1"
#> [1] "2018 6 1"
#> [1] "2018 9 1"
#> [1] "2018 10 1"
#> [1] "2018 12 1"
#> [1] "2020 1 1"
#> [1] "2020 2 1"
#> [1] "2020 4 1"
#> [1] "2020 5 1"
#> [1] "2020 6 1"
#> [1] "2020 9 2"
#> [1] "2020 10 1"

Created on 2022-11-06 with reprex v2.0.2

  • Related