Creating new rows when specific terms within complex strings are recognised-CodePudding

I have a dataframe containing strings with various formats (words, cases, special characters, spaces, hyphen, overlapping words). These were selected by surveyors from a pre-defined list. But the surveyor could select multiple terms for each row. What I would like to do is recognise where these pre-defined terms occur and place them on a new row and duplicate all other columns. Some example data where the pre-defined terms are different pressures:

Pre_defined_pressures <- c("Urbanisation", "Land cover (general)", "Agriculture / Horticulture", "Water Quality", "General-effects")
Surveyor_df <- as.data.frame(c("A","B","C","D"))
colnames(Surveyor_df ) <- "ID"
Surveyor_df$Year <- c(1999,1999,2000,2000)
Surveyor_df$Pressure <- c("Urbanisation Land cover (general)", 
                      "Urbanisation Land cover (general) Agriculture / Horticulture General-effects",
                      "Urbanisation Land cover (general) Water Quality General-effects", 
                      "Urbanisation Land cover (general) Agriculture / Horticulture Water Quality General-effects")

How the correct data frame should look.

Correct_df <- as.data.frame(c("A","A","B","B","B","B","C","C","C","C","D","D","D","D","D"))
colnames(Correct_df ) <- "ID"
Correct_df$Year <- c(1999,1999,1999,1999,1999,1999,2000,2000,2000,2000,2000,2000,2000,2000,2000)
Correct_df$Pressure <- c("Urbanisation","Land cover (general)",
                     "Urbanisation","Land cover (general)","Agriculture / Horticulture","General-effects",
                     "Urbanisation", "Land cover (general)", "Water Quality", "General-effects",
                     "Urbanisation", "Land cover (general)", "Agriculture / Horticulture", "Water Quality","General-effects")

CodePudding user response：

One way of doing this is extracting each possible value into a separate column and then pivotting longer. This would be easier if the values were separated by a unique character, but in your data a space can either be a lexical space or a separator.

library(dplyr)
library(stringr)
library(tidyr)

Surveyor_df |> 
  mutate(urb = str_extract(Pressure, "Urbanisation"),
         lan = str_extract(Pressure, "Land cover \\(general\\)"),
         agr = str_extract(Pressure, "Agriculture / Horticulture"),
         wat = str_extract(Pressure, "Water Quality")
         ) |> 
  pivot_longer(cols = urb:wat, values_to = "pressure") |> 
  select(-Pressure, -name) |> 
  filter(!is.na(pressure))

Output:

#> # A tibble: 12 x 3
#>    ID     Year pressure                  
#>    <chr> <dbl> <chr>                     
#>  1 A      1999 Urbanisation              
#>  2 A      1999 Land cover (general)      
#>  3 B      1999 Urbanisation              
#>  4 B      1999 Land cover (general)      
#>  5 B      1999 Agriculture / Horticulture
#>  6 C      2000 Urbanisation              
#>  7 C      2000 Land cover (general)      
#>  8 C      2000 Water Quality             
#>  9 D      2000 Urbanisation              
#> 10 D      2000 Land cover (general)      
#> 11 D      2000 Agriculture / Horticulture
#> 12 D      2000 Water Quality

CodePudding user response：

Define a named vector with comma as a separator mark (could be any symbol). Replace existing Pressure values, and remove trailing comma, then create separate rows for each instance using tidy::separate_rows. Then some tidying up of text.

This method avoids manually typing out values, which may be a pain if you have a lot of different values for Pressure.

Updated following OP's comment and revised input dataset.


library(tidyr)
library(dplyr)
library(stringr)



vec_p <- paste0(Pre_defined_pressures, ",")

names(vec_p) <-   Pre_defined_pressures

Correct_df <- 
  Surveyor_df %>% 
  mutate(Pressure = str_replace_all(Pressure, coll(vec_p)),
         Pressure = str_remove(Pressure, ",$")) %>% 
  separate_rows(Pressure, sep = ",") %>% 
  mutate(Pressure = str_squish(Pressure))

Correct_df
#> # A tibble: 15 × 3
#>    ID     Year Pressure                  
#>    <chr> <dbl> <chr>                     
#>  1 A      1999 Urbanisation              
#>  2 A      1999 Land cover (general)      
#>  3 B      1999 Urbanisation              
#>  4 B      1999 Land cover (general)      
#>  5 B      1999 Agriculture / Horticulture
#>  6 B      1999 General-effects           
#>  7 C      2000 Urbanisation              
#>  8 C      2000 Land cover (general)      
#>  9 C      2000 Water Quality             
#> 10 C      2000 General-effects           
#> 11 D      2000 Urbanisation              
#> 12 D      2000 Land cover (general)      
#> 13 D      2000 Agriculture / Horticulture
#> 14 D      2000 Water Quality             
#> 15 D      2000 General-effects

^{Created on 2022-05-24 by the reprex package (v2.0.1)}

CodePudding user response：

One way in base might be to find the matches using grepl and then expand the Surveyor_df using col and the Pre_defined_pressures using row.

. <- t(sapply(Pre_defined_pressures, grepl, Surveyor_df$Pressure, fixed=TRUE))
. <- cbind(Surveyor_df[col(.)[.], 1:2], Pressure = Pre_defined_pressures[row(.)[.]])
all.equal(., Correct_df, check.attributes = FALSE)
#[1] TRUE
.
#    ID Year                   Pressure
#1    A 1999               Urbanisation
#1.1  A 1999       Land cover (general)
#2    B 1999               Urbanisation
#2.1  B 1999       Land cover (general)
#2.2  B 1999 Agriculture / Horticulture
#2.3  B 1999            General-effects
#3    C 2000               Urbanisation
#3.1  C 2000       Land cover (general)
#3.2  C 2000              Water Quality
#3.3  C 2000            General-effects
#4    D 2000               Urbanisation
#4.1  D 2000       Land cover (general)
#4.2  D 2000 Agriculture / Horticulture
#4.3  D 2000              Water Quality
#4.4  D 2000            General-effects