I have a dataframe as below
year district
2017 arrah
2017 buxar
2017 rohtas
2018 rohtas
2018 arwal
2018 seohar
2019 nawda
2019 buxar
2019 jamui
I want to subset data in a way that repeated district in 2018 or 2019 should not appear in the subset as shown below
year district
2017 arrah
2017 buxar
2017 rohtas
2018 arwal
2018 seohar
2019 nawda
2019 jamui
I have tried anti_join
function but it is not solving my problem.
CodePudding user response:
dplyr
library(dplyr)
quux %>%
group_by(district) %>%
slice_min(year) %>%
ungroup()
# # A tibble: 7 x 2
# year district
# <int> <chr>
# 1 2017 arrah
# 2 2018 arwal
# 3 2017 buxar
# 4 2019 jamui
# 5 2019 nawda
# 6 2017 rohtas
# 7 2018 seohar
base R
quux[ave(quux$year, quux$district, FUN = function(y) y == min(y)) > 0, ]
# year district
# 1 2017 arrah
# 2 2017 buxar
# 3 2017 rohtas
# 5 2018 arwal
# 6 2018 seohar
# 7 2019 nawda
# 9 2019 jamui
Data
quux <- structure(list(year = c(2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2019L, 2019L, 2019L), district = c("arrah", "buxar", "rohtas", "rohtas", "arwal", "seohar", "nawda", "buxar", "jamui")), class = "data.frame", row.names = c(NA, -9L))
CodePudding user response:
We can try the following data.table
option
> library(data.table)
> setcolorder(setDT(df)[, .SD[which.min(year)], district], "year")[]
year district
1: 2017 arrah
2: 2017 buxar
3: 2017 rohtas
4: 2018 arwal
5: 2018 seohar
6: 2019 nawda
7: 2019 jamui
CodePudding user response:
Here is the sample subsetting records having earliest years for every district. You may select other summarise (agregate) function:
library(dplyr)
df <- data.frame(
year = c (
2017,
2017,
2017,
2018,
2018,
2018,
2019,
2019,
2019
),
district = c (
"arrah",
"buxar",
"rohtas",
"rohtas",
"arwal",
"seohar",
"nawda",
"buxar",
"jamui"
)
)
df %>% group_by(district) %>%
summarise(year = min(year))