I have a dataframe containing strings with various formats (words, cases, special characters, spaces, hyphen, overlapping words). These were selected by surveyors from a pre-defined list. But the surveyor could select multiple terms for each row. What I would like to do is recognise where these pre-defined terms occur and place them on a new row and duplicate all other columns. Some example data where the pre-defined terms are different pressures:
Pre_defined_pressures <- c("Urbanisation", "Land cover (general)", "Agriculture / Horticulture", "Water Quality", "General-effects")
Surveyor_df <- as.data.frame(c("A","B","C","D"))
colnames(Surveyor_df ) <- "ID"
Surveyor_df$Year <- c(1999,1999,2000,2000)
Surveyor_df$Pressure <- c("Urbanisation Land cover (general)",
"Urbanisation Land cover (general) Agriculture / Horticulture General-effects",
"Urbanisation Land cover (general) Water Quality General-effects",
"Urbanisation Land cover (general) Agriculture / Horticulture Water Quality General-effects")
How the correct data frame should look.
Correct_df <- as.data.frame(c("A","A","B","B","B","B","C","C","C","C","D","D","D","D","D"))
colnames(Correct_df ) <- "ID"
Correct_df$Year <- c(1999,1999,1999,1999,1999,1999,2000,2000,2000,2000,2000,2000,2000,2000,2000)
Correct_df$Pressure <- c("Urbanisation","Land cover (general)",
"Urbanisation","Land cover (general)","Agriculture / Horticulture","General-effects",
"Urbanisation", "Land cover (general)", "Water Quality", "General-effects",
"Urbanisation", "Land cover (general)", "Agriculture / Horticulture", "Water Quality","General-effects")
CodePudding user response:
One way of doing this is extracting each possible value into a separate column and then pivotting longer. This would be easier if the values were separated by a unique character, but in your data a space can either be a lexical space or a separator.
library(dplyr)
library(stringr)
library(tidyr)
Surveyor_df |>
mutate(urb = str_extract(Pressure, "Urbanisation"),
lan = str_extract(Pressure, "Land cover \\(general\\)"),
agr = str_extract(Pressure, "Agriculture / Horticulture"),
wat = str_extract(Pressure, "Water Quality")
) |>
pivot_longer(cols = urb:wat, values_to = "pressure") |>
select(-Pressure, -name) |>
filter(!is.na(pressure))
Output:
#> # A tibble: 12 x 3
#> ID Year pressure
#> <chr> <dbl> <chr>
#> 1 A 1999 Urbanisation
#> 2 A 1999 Land cover (general)
#> 3 B 1999 Urbanisation
#> 4 B 1999 Land cover (general)
#> 5 B 1999 Agriculture / Horticulture
#> 6 C 2000 Urbanisation
#> 7 C 2000 Land cover (general)
#> 8 C 2000 Water Quality
#> 9 D 2000 Urbanisation
#> 10 D 2000 Land cover (general)
#> 11 D 2000 Agriculture / Horticulture
#> 12 D 2000 Water Quality
CodePudding user response:
Define a named vector with comma as a separator mark (could be any symbol). Replace existing Pressure
values, and remove trailing comma, then create separate rows for each instance using tidy::separate_rows
. Then some tidying up of text.
This method avoids manually typing out values, which may be a pain if you have a lot of different values for Pressure
.
Updated following OP's comment and revised input dataset.
library(tidyr)
library(dplyr)
library(stringr)
vec_p <- paste0(Pre_defined_pressures, ",")
names(vec_p) <- Pre_defined_pressures
Correct_df <-
Surveyor_df %>%
mutate(Pressure = str_replace_all(Pressure, coll(vec_p)),
Pressure = str_remove(Pressure, ",$")) %>%
separate_rows(Pressure, sep = ",") %>%
mutate(Pressure = str_squish(Pressure))
Correct_df
#> # A tibble: 15 × 3
#> ID Year Pressure
#> <chr> <dbl> <chr>
#> 1 A 1999 Urbanisation
#> 2 A 1999 Land cover (general)
#> 3 B 1999 Urbanisation
#> 4 B 1999 Land cover (general)
#> 5 B 1999 Agriculture / Horticulture
#> 6 B 1999 General-effects
#> 7 C 2000 Urbanisation
#> 8 C 2000 Land cover (general)
#> 9 C 2000 Water Quality
#> 10 C 2000 General-effects
#> 11 D 2000 Urbanisation
#> 12 D 2000 Land cover (general)
#> 13 D 2000 Agriculture / Horticulture
#> 14 D 2000 Water Quality
#> 15 D 2000 General-effects
Created on 2022-05-24 by the reprex package (v2.0.1)
CodePudding user response:
One way in base might be to find the matches using grepl
and then expand the Surveyor_df using col
and the Pre_defined_pressures using row
.
. <- t(sapply(Pre_defined_pressures, grepl, Surveyor_df$Pressure, fixed=TRUE))
. <- cbind(Surveyor_df[col(.)[.], 1:2], Pressure = Pre_defined_pressures[row(.)[.]])
all.equal(., Correct_df, check.attributes = FALSE)
#[1] TRUE
.
# ID Year Pressure
#1 A 1999 Urbanisation
#1.1 A 1999 Land cover (general)
#2 B 1999 Urbanisation
#2.1 B 1999 Land cover (general)
#2.2 B 1999 Agriculture / Horticulture
#2.3 B 1999 General-effects
#3 C 2000 Urbanisation
#3.1 C 2000 Land cover (general)
#3.2 C 2000 Water Quality
#3.3 C 2000 General-effects
#4 D 2000 Urbanisation
#4.1 D 2000 Land cover (general)
#4.2 D 2000 Agriculture / Horticulture
#4.3 D 2000 Water Quality
#4.4 D 2000 General-effects