Im working with a dataset where some of the table elements are mixed. Ideally, rows from the location column containing the word "District", i.e., "District 1", "District 2" etc.., should be in a separate column from the location column.
A snippet of the dataset looks like this:
location iri length
<chr> <dbl> <dbl>
1 District 1 NA NA
2 S00180CB 4.60 32171
3 S00186CB 5.17 31286
4 S00188CB 6.02 4742
5 District 2 NA NA
6 S00136CB 4.91 5968
7 S00285CB 4.33 29340
8 S00288CB 10.9 141
9 S00289CB 4.41 9126
10 District 3 NA NA
11 S00231CB 4.34 6895
12 S00266CB 5.65 18985
13 S00381CB 4.39 13799
14 S00382CB 8.96 124
15 District 4 NA NA
16 S00303CB 4.11 79599
17 S00311CB 3.17 625
18 District 5 NA NA
19 S00163CB 3.49 17996
20 S00253CB 2.65 905
21 S00259CB 2.26 905
Code i tried:
library(tidyverse)
df |>
mutate(district = ifelse(str_detect(location, "District"),
rep_len("District 1", length.out = n()), "none")
I expected it to come out like this:
district location iri length
<chr> <chr> <dbl> <dbl>
1 District 1 S00180CB 4.60 32171
2 District 1 S00186CB 5.17 31286
3 District 1 S00188CB 6.02 4742
4 District 2 S00136CB 4.91 5968
5 District 2 S00285CB 4.33 29340
6 District 2 S00288CB 10.9 141
7 District 2 S00289CB 4.41 9126
8 District 3 S00231CB 4.34 6895
9 District 3 S00266CB 5.65 18985
10 District 3 S00381CB 4.39 13799
11 District 3 S00382CB 8.96 124
12 District 4 S00303CB 4.11 79599
13 District 4 S00311CB 3.17 625
14 District 5 S00163CB 3.49 17996
15 District 5 S00253CB 2.65 905
16 District 5 S00259CB 2.26 905
but the result did not come out as expected.
Is there a more efficient way to accomplish this?
CodePudding user response:
You could use mutate
to create the District
column, then fill
to replace the missing values, then omit any rows where values are NA
.
library(dplyr)
library(tidyr)
df1 %>%
mutate(District = ifelse(str_detect(location, "District"), location, NA)) %>%
fill(District) %>%
na.omit()
Result:
location iri length District
2 S00180CB 4.60 32171 District 1
3 S00186CB 5.17 31286 District 1
4 S00188CB 6.02 4742 District 1
6 S00136CB 4.91 5968 District 2
7 S00285CB 4.33 29340 District 2
8 S00288CB 10.90 141 District 2
9 S00289CB 4.41 9126 District 2
11 S00231CB 4.34 6895 District 3
12 S00266CB 5.65 18985 District 3
13 S00381CB 4.39 13799 District 3
14 S00382CB 8.96 124 District 3
16 S00303CB 4.11 79599 District 4
17 S00311CB 3.17 625 District 4
19 S00163CB 3.49 17996 District 5
20 S00253CB 2.65 905 District 5
21 S00259CB 2.26 905 District 5
The data. Note: named df1
not df
, which is bad practice.
df1 <- structure(list(location = c("District 1", "S00180CB", "S00186CB",
"S00188CB", "District 2", "S00136CB", "S00285CB", "S00288CB",
"S00289CB", "District 3", "S00231CB", "S00266CB", "S00381CB",
"S00382CB", "District 4", "S00303CB", "S00311CB", "District 5",
"S00163CB", "S00253CB", "S00259CB"), iri = c(NA, 4.6, 5.17, 6.02,
NA, 4.91, 4.33, 10.9, 4.41, NA, 4.34, 5.65, 4.39, 8.96, NA, 4.11,
3.17, NA, 3.49, 2.65, 2.26), length = c(NA, 32171L, 31286L, 4742L,
NA, 5968L, 29340L, 141L, 9126L, NA, 6895L, 18985L, 13799L, 124L,
NA, 79599L, 625L, NA, 17996L, 905L, 905L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21"))
CodePudding user response:
Here is an alternative dplyr
way with grouping and filtering:
library(dplyr)
df %>%
group_by(group = cumsum(grepl('District', location))) %>%
mutate(District = first(location)) %>%
filter(!grepl('District', location)) %>%
ungroup() %>%
select(-group)
location iri length District
<chr> <dbl> <int> <chr>
1 S00180CB 4.6 32171 District 1
2 S00186CB 5.17 31286 District 1
3 S00188CB 6.02 4742 District 1
4 S00136CB 4.91 5968 District 2
5 S00285CB 4.33 29340 District 2
6 S00288CB 10.9 141 District 2
7 S00289CB 4.41 9126 District 2
8 S00231CB 4.34 6895 District 3
9 S00266CB 5.65 18985 District 3
10 S00381CB 4.39 13799 District 3
11 S00382CB 8.96 124 District 3
12 S00303CB 4.11 79599 District 4
13 S00311CB 3.17 625 District 4
14 S00163CB 3.49 17996 District 5
15 S00253CB 2.65 905 District 5
16 S00259CB 2.26 905 District 5