Home > database >  Create new column based on presence/absence of string in other column by group
Create new column based on presence/absence of string in other column by group

Time:05-07

I have this dataset about vessels locations, where the same "id" can correspond to two levels. Corresponds to a defined category, such as "fishing" and may also appear as "unspecified". I would like to create a new column, where whenever an "id" appears as "unspecified" and also as another category (in other rows), "unspecified" is replaced by that category.

#dataset example
library(dplyr)

levels <- c("passenger", "passenger", "unspecified", "passenger", "passenger",
            "passenger", "passenger", "passenger", "passenger", "passenger",
            "unspecified", "passenger", "fishing", "unspecified", "fishing", 
            "fishing", "fishing","unspecified", "fishing", "fishing", 
            "unspecified","fishing", "fishing", "fishing", "unspecified",
            "unspecified", "unspecified")
id <- c("844", "844", "844", "844", "844","844", "844", "844", "844", "844",
        "844", "844", "845", "845", "845", "845", "845","845", "845", "845", 
        "845","845", "845", "845", "825", "825", "825")
lat <- c(-30.6456, -29.5648, -27.6667, -31.5587, -30.6934, -29.3147, -23.0538, 
         -26.5877, -26.6923, -23.40865, -23.1143, -23.28331, -31.6456, -24.5648, 
         -27.6867, -31.4587, -30.6784, -28.3447, -23.0466, -27.5877, -26.8524, 
         -23.8855, -24.1143, -23.5874, -23.5259, -22.8788, -22.1324)
long <- c(-50.4879, -49.8715, -51.8716, -50.4456, -50.9842, -51.9787, -41.2343, 
          -40.2859, -40.19599, -41.64302, -41.58042, -41.55057, -50.4576, -48.8715, 
          -51.4566, -51.4456, -50.4477, -50.9937, -41.4789, -41.3859, -40.2536, 
          -41.6502, -40.5442, -41.4057, -40.4058, -42.4877, -41.4545)

df <- tibble(levels = as.factor(levels), id  = as.factor(id), lat, long)

Here is my expected output:

> output %>% print(n = 27)
# A tibble: 27 x 5
   levels      id      lat  long new_colum  
   <fct>       <fct> <dbl> <dbl> <fct>      
 1 passenger   844   -30.6 -50.5 passenger  
 2 passenger   844   -29.6 -49.9 passenger  
 3 unspecified 844   -27.7 -51.9 passenger  
 4 passenger   844   -31.6 -50.4 passenger  
 5 passenger   844   -30.7 -51.0 passenger  
 6 passenger   844   -29.3 -52.0 passenger  
 7 passenger   844   -23.1 -41.2 passenger  
 8 passenger   844   -26.6 -40.3 passenger  
 9 passenger   844   -26.7 -40.2 passenger  
10 passenger   844   -23.4 -41.6 passenger  
11 unspecified 844   -23.1 -41.6 passenger  
12 passenger   844   -23.3 -41.6 passenger  
13 fishing     845   -31.6 -50.5 fishing    
14 unspecified 845   -24.6 -48.9 fishing    
15 fishing     845   -27.7 -51.5 fishing    
16 fishing     845   -31.5 -51.4 fishing    
17 fishing     845   -30.7 -50.4 fishing    
18 unspecified 845   -28.3 -51.0 fishing    
19 fishing     845   -23.0 -41.5 fishing    
20 fishing     845   -27.6 -41.4 fishing    
21 unspecified 845   -26.9 -40.3 fishing    
22 fishing     845   -23.9 -41.7 fishing    
23 fishing     845   -24.1 -40.5 fishing    
24 fishing     845   -23.6 -41.4 fishing    
25 unspecified 825   -23.5 -40.4 unspecified
26 unspecified 825   -22.9 -42.5 unspecified
27 unspecified 825   -22.1 -41.5 unspecified

It's for when the same id is unspecified but it's also belongs some other level. When it's just unspecified it stays that way, when it also belongs to other level replaces unspecified with this one.

CodePudding user response:

You can replace "unspecified" with NA, fill these NA with the previous value in the same id group, and then replace the remaining NA back to "unspecified".

library(tidyverse)

df %>%
  mutate(new_colum = na_if(levels, "unspecified")) %>%
  group_by(id) %>% 
  fill(new_colum) %>%
  ungroup() %>%
  replace_na(list(new_colum = "unspecified"))
# A tibble: 27 × 5
   levels      id      lat  long new_colum  
   <fct>       <fct> <dbl> <dbl> <fct>      
 1 passenger   844   -30.6 -50.5 passenger  
 2 passenger   844   -29.6 -49.9 passenger  
 3 unspecified 844   -27.7 -51.9 passenger  
 4 passenger   844   -31.6 -50.4 passenger  
 5 passenger   844   -30.7 -51.0 passenger  
 6 passenger   844   -29.3 -52.0 passenger  
 7 passenger   844   -23.1 -41.2 passenger  
 8 passenger   844   -26.6 -40.3 passenger  
 9 passenger   844   -26.7 -40.2 passenger  
10 passenger   844   -23.4 -41.6 passenger  
11 unspecified 844   -23.1 -41.6 passenger  
12 passenger   844   -23.3 -41.6 passenger  
13 fishing     845   -31.6 -50.5 fishing    
14 unspecified 845   -24.6 -48.9 fishing    
15 fishing     845   -27.7 -51.5 fishing    
16 fishing     845   -31.5 -51.4 fishing    
17 fishing     845   -30.7 -50.4 fishing    
18 unspecified 845   -28.3 -51.0 fishing    
19 fishing     845   -23.0 -41.5 fishing    
20 fishing     845   -27.6 -41.4 fishing    
21 unspecified 845   -26.9 -40.3 fishing    
22 fishing     845   -23.9 -41.7 fishing    
23 fishing     845   -24.1 -40.5 fishing    
24 fishing     845   -23.6 -41.4 fishing    
25 unspecified 825   -23.5 -40.4 unspecified
26 unspecified 825   -22.9 -42.5 unspecified
27 unspecified 825   -22.1 -41.5 unspecified

CodePudding user response:

You can use an ifelse statement to specify condition where there is only one type of levels AND at the same time that levels equals to "unspecified", do not change these records. Otherwise, change it to another levels that is not "unspecified".

library(dplyr)

df %>% 
  group_by(id) %>% 
  mutate(new_column = ifelse(n_distinct(levels) == 1 & levels == "unspecified",
                             as.character(levels), 
                             as.character(levels)[levels != "unspecified"]))

# A tibble: 27 × 5
# Groups:   id [3]
   levels      id      lat  long new_column 
   <fct>       <fct> <dbl> <dbl> <chr>      
 1 passenger   844   -30.6 -50.5 passenger  
 2 passenger   844   -29.6 -49.9 passenger  
 3 unspecified 844   -27.7 -51.9 passenger  
 4 passenger   844   -31.6 -50.4 passenger  
 5 passenger   844   -30.7 -51.0 passenger  
 6 passenger   844   -29.3 -52.0 passenger  
 7 passenger   844   -23.1 -41.2 passenger  
 8 passenger   844   -26.6 -40.3 passenger  
 9 passenger   844   -26.7 -40.2 passenger  
10 passenger   844   -23.4 -41.6 passenger  
11 unspecified 844   -23.1 -41.6 passenger  
12 passenger   844   -23.3 -41.6 passenger  
13 fishing     845   -31.6 -50.5 fishing    
14 unspecified 845   -24.6 -48.9 fishing    
15 fishing     845   -27.7 -51.5 fishing    
16 fishing     845   -31.5 -51.4 fishing    
17 fishing     845   -30.7 -50.4 fishing    
18 unspecified 845   -28.3 -51.0 fishing    
19 fishing     845   -23.0 -41.5 fishing    
20 fishing     845   -27.6 -41.4 fishing    
21 unspecified 845   -26.9 -40.3 fishing    
22 fishing     845   -23.9 -41.7 fishing    
23 fishing     845   -24.1 -40.5 fishing    
24 fishing     845   -23.6 -41.4 fishing    
25 unspecified 825   -23.5 -40.4 unspecified
26 unspecified 825   -22.9 -42.5 unspecified
27 unspecified 825   -22.1 -41.5 unspecified 

CodePudding user response:

I believe this is the simplest/most intuitive solution:

The case_when statement checks each group - if all levels are "unspecified", then new_column becomes "unspecified", otherwise if it contains either "fishing" or "passenger", it becomes one of those.

library(dplyr)
df %>%
    group_by(id) %>%
    mutate(new_column = case_when(
        all(levels == "unspecified") ~ "unspecified",
        any(levels == "fishing") ~ "fishing",
        any(levels == "passenger") ~ "passenger"
        )
    )
# A tibble: 27 × 5
# Groups:   id [3]
   levels      id      lat  long new_column 
   <fct>       <fct> <dbl> <dbl> <chr>      
 1 passenger   844   -30.6 -50.5 passenger  
 2 passenger   844   -29.6 -49.9 passenger  
 3 unspecified 844   -27.7 -51.9 passenger  
 4 passenger   844   -31.6 -50.4 passenger  
 5 passenger   844   -30.7 -51.0 passenger  
 6 passenger   844   -29.3 -52.0 passenger  
 7 passenger   844   -23.1 -41.2 passenger  
 8 passenger   844   -26.6 -40.3 passenger  
 9 passenger   844   -26.7 -40.2 passenger  
10 passenger   844   -23.4 -41.6 passenger  
11 unspecified 844   -23.1 -41.6 passenger  
12 passenger   844   -23.3 -41.6 passenger  
13 fishing     845   -31.6 -50.5 fishing    
14 unspecified 845   -24.6 -48.9 fishing    
15 fishing     845   -27.7 -51.5 fishing    
16 fishing     845   -31.5 -51.4 fishing    
17 fishing     845   -30.7 -50.4 fishing    
18 unspecified 845   -28.3 -51.0 fishing    
19 fishing     845   -23.0 -41.5 fishing    
20 fishing     845   -27.6 -41.4 fishing    
21 unspecified 845   -26.9 -40.3 fishing    
22 fishing     845   -23.9 -41.7 fishing    
23 fishing     845   -24.1 -40.5 fishing    
24 fishing     845   -23.6 -41.4 fishing    
25 unspecified 825   -23.5 -40.4 unspecified
26 unspecified 825   -22.9 -42.5 unspecified
27 unspecified 825   -22.1 -41.5 unspecified
  • Related