I have this dataset about vessels locations, where the same "id" can correspond to two levels. Corresponds to a defined category, such as "fishing" and may also appear as "unspecified". I would like to create a new column, where whenever an "id" appears as "unspecified" and also as another category (in other rows), "unspecified" is replaced by that category.
#dataset example
library(dplyr)
levels <- c("passenger", "passenger", "unspecified", "passenger", "passenger",
"passenger", "passenger", "passenger", "passenger", "passenger",
"unspecified", "passenger", "fishing", "unspecified", "fishing",
"fishing", "fishing","unspecified", "fishing", "fishing",
"unspecified","fishing", "fishing", "fishing", "unspecified",
"unspecified", "unspecified")
id <- c("844", "844", "844", "844", "844","844", "844", "844", "844", "844",
"844", "844", "845", "845", "845", "845", "845","845", "845", "845",
"845","845", "845", "845", "825", "825", "825")
lat <- c(-30.6456, -29.5648, -27.6667, -31.5587, -30.6934, -29.3147, -23.0538,
-26.5877, -26.6923, -23.40865, -23.1143, -23.28331, -31.6456, -24.5648,
-27.6867, -31.4587, -30.6784, -28.3447, -23.0466, -27.5877, -26.8524,
-23.8855, -24.1143, -23.5874, -23.5259, -22.8788, -22.1324)
long <- c(-50.4879, -49.8715, -51.8716, -50.4456, -50.9842, -51.9787, -41.2343,
-40.2859, -40.19599, -41.64302, -41.58042, -41.55057, -50.4576, -48.8715,
-51.4566, -51.4456, -50.4477, -50.9937, -41.4789, -41.3859, -40.2536,
-41.6502, -40.5442, -41.4057, -40.4058, -42.4877, -41.4545)
df <- tibble(levels = as.factor(levels), id = as.factor(id), lat, long)
Here is my expected output:
> output %>% print(n = 27)
# A tibble: 27 x 5
levels id lat long new_colum
<fct> <fct> <dbl> <dbl> <fct>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified
It's for when the same id
is unspecified
but it's also belongs some other level. When it's just unspecified
it stays that way, when it also belongs to other level replaces unspecified
with this one.
CodePudding user response:
You can replace "unspecified"
with NA
, fill these NA
with the previous value in the same id
group, and then replace the remaining NA
back to "unspecified"
.
library(tidyverse)
df %>%
mutate(new_colum = na_if(levels, "unspecified")) %>%
group_by(id) %>%
fill(new_colum) %>%
ungroup() %>%
replace_na(list(new_colum = "unspecified"))
# A tibble: 27 × 5
levels id lat long new_colum
<fct> <fct> <dbl> <dbl> <fct>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified
CodePudding user response:
You can use an ifelse
statement to specify condition where there is only one type of levels
AND at the same time that levels
equals to "unspecified", do not change these records. Otherwise, change it to another levels
that is not "unspecified".
library(dplyr)
df %>%
group_by(id) %>%
mutate(new_column = ifelse(n_distinct(levels) == 1 & levels == "unspecified",
as.character(levels),
as.character(levels)[levels != "unspecified"]))
# A tibble: 27 × 5
# Groups: id [3]
levels id lat long new_column
<fct> <fct> <dbl> <dbl> <chr>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified
CodePudding user response:
I believe this is the simplest/most intuitive solution:
The case_when
statement checks each group - if all levels
are "unspecified"
, then new_column
becomes "unspecified"
, otherwise if it contains either "fishing"
or "passenger"
, it becomes one of those.
library(dplyr)
df %>%
group_by(id) %>%
mutate(new_column = case_when(
all(levels == "unspecified") ~ "unspecified",
any(levels == "fishing") ~ "fishing",
any(levels == "passenger") ~ "passenger"
)
)
# A tibble: 27 × 5
# Groups: id [3]
levels id lat long new_column
<fct> <fct> <dbl> <dbl> <chr>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified