I have a data frame ordered by id variables ("city"), and I want to keep the second observation of those cities that have more than one observation.
For example, here's an example data set:
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
Then we have:
city value
1 1 3
2 1 5
3 2 7
4 3 8
5 3 2
6 4 5
7 5 4
8 6 2
9 7 3
10 7 2
11 8 3
The ideal outcome would be:
city value
2 1 5
3 2 7
5 3 2
6 4 5
7 5 4
8 6 2
10 7 2
11 8 3
Any help is appreciated!
CodePudding user response:
dplyr
library(dplyr)
mydata %>%
group_by(city) %>%
filter(n() == 1L | row_number() == 2L) %>%
ungroup()
# # A tibble: 8 x 2
# city value
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 2
# 4 4 5
# 5 5 4
# 6 6 2
# 7 7 2
# 8 8 3
or slightly different
mydata %>%
group_by(city) %>%
slice(min(n(), 2)) %>%
ungroup()
base R
ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
# [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
mydata[ind,]
# city value
# 2 1 5
# 3 2 7
# 5 3 2
# 6 4 5
# 7 5 4
# 8 6 2
# 10 7 2
# 11 8 3
data.table
Since you mentioned "is way bigger", you might consider data.table
at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)
library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
# city value
# <num> <num>
# 1: 1 5
# 2: 2 7
# 3: 3 2
# 4: 4 5
# 5: 5 4
# 6: 6 2
# 7: 7 2
# 8: 8 3
CodePudding user response:
Here is logic that uses pmin()
to choose either 2 or 1 depending on the length of the vector of value
-values:
aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] )
city value
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
The aggregate function delivers vectors of value
split on the basis of city
-values.
CodePudding user response:
You may try
library(dplyr)
mydata %>%
group_by(city) %>%
filter(case_when(n()> 1 ~ row_number() == 2,
TRUE ~ row_number()== 1))
city value
<dbl> <dbl>
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
CodePudding user response:
Another dplyr
solution:
mydata %>% group_by(city) %>%
summarize(value=value[pmin(2, n())])
Or:
mydata %>% group_by(city) %>%
summarize(value=ifelse(n() >= 2, value[2], value[1]))
Both Output:
city value
<dbl> <dbl>
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
CodePudding user response:
If base R is ok try this:
EDIT (since performance really seems to be important):
Using if
as a function, should give a 100-fold speed-up in some cases.
aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )
city value
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
CodePudding user response:
Combining group_split
and map_if
:
library(tidyverse)
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
value2 <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
group_split(mydata, city) %>%
map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows()
#> # A tibble: 8 × 2
#> city value
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 7
#> 3 3 2
#> 4 4 5
#> 5 5 4
#> 6 6 2
#> 7 7 2
#> 8 8 3
Created on 2021-11-30 by the reprex package (v2.0.1)