I have a dataframe in R and need to remove rows that do not follow an expected sequence in a column. A shortened version of my dataframe is as follows:
splits_level <- structure(list(name = c("1", "2", "3", "1", "2", "3", "1", "2",
"3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3",
"1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "4",
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1",
"2", "3", "4"), value = c(NA, NA, NA, "5", "4", "3", "00:01:35.780",
"00:03:12.220", "00:04:50.010", NA, NA, NA, "d500m", "d1000m",
"d1500m", "7cc15908-19a4-4e71-aa7a-8381000f47b5", "53b98dcd-f995-45a3-8803-395cdaedb4c2",
"8aedc73c-1780-4dc8-a2f8-4179c16e7b49", "7cc15908-19a4-4e71-aa7a-8381000f47b5",
"53b98dcd-f995-45a3-8803-395cdaedb4c2", "8aedc73c-1780-4dc8-a2f8-4179c16e7b49",
"31f1f791-977a-497d-9f38-540f66e54040", "58b439af-8221-43d2-81cd-9b21455441c1",
"c98a8ecc-9a58-40b1-8077-94df26507807", "40a17577-c7fd-4a69-b2a7-a95e28a186e6",
"40a17577-c7fd-4a69-b2a7-a95e28a186e6", "40a17577-c7fd-4a69-b2a7-a95e28a186e6",
"02c324d6-ec9f-4920-aeae-1416ae509f5f", "37f3526b-6ff9-495d-b8d3-5224330635fc",
"a0dfc090-93ab-443b-b764-9b596cace54f", NA, NA, NA, NA, "6",
"5", "2", "1", "00:01:35.930", "00:03:12.630", "00:04:49.950",
"00:06:27.120", NA, NA, NA, NA, "d500m", "d1000m", "d1500m",
"d2000m")), row.names = c(NA, -50L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to remove the rows that do not follow the sequence "1, 2, 3, 4" in the name column - in this example it would be the first 30 rows, but it may not necessarily always be the first 30. They could be in the middle of the df etc.
I am new to R and stuck with how to achieve this. Any help would be greatly appreciated! Thanks!
CodePudding user response:
Perhaps this?
vec <- as.character(1:4)
splits_level %>%
group_by(grp = cumsum(name == "1")) %>%
dplyr::filter(n() == length(vec) && all(name == vec)) %>%
ungroup() %>%
select(-grp)
# # A tibble: 20 × 2
# name value
# <chr> <chr>
# 1 1 NA
# 2 2 NA
# 3 3 NA
# 4 4 NA
# 5 1 6
# 6 2 5
# 7 3 2
# 8 4 1
# 9 1 00:01:35.930
# 10 2 00:03:12.630
# 11 3 00:04:49.950
# 12 4 00:06:27.120
# 13 1 NA
# 14 2 NA
# 15 3 NA
# 16 4 NA
# 17 1 d500m
# 18 2 d1000m
# 19 3 d1500m
# 20 4 d2000m
CodePudding user response:
In base R, you could identify the positions where names == 4
, then Vectorize
a user defined function that seq
uences the row numbers. Then simply index those row numbers in your original data using which
:
vecfun <- Vectorize(function(x){
x <- as.numeric(x)
rev(seq(x, x-3, -1))
})
splits_level[as.vector(vecfun(which(splits_level$name %in% "4"))),]
Output:
# <chr> <chr>
# 1 1 NA
# 2 2 NA
# 3 3 NA
# 4 4 NA
# 5 1 6
# 6 2 5
# 7 3 2
# 8 4 1
# 9 1 00:01:35.930
# 10 2 00:03:12.630
# 11 3 00:04:49.950
# 12 4 00:06:27.120
# 13 1 NA
# 14 2 NA
# 15 3 NA
# 16 4 NA
# 17 1 d500m
# 18 2 d1000m
# 19 3 d1500m
# 20 4 d2000m