I have the following data frame (2 column data frame). The column with text will be divided into two columns based on the presence of a word. In this case, the word pair is unit #2
. The new data will have a column 2 with the sentences before unit #2
and the new column 3 with the sentences starting with unit #2
.
report <- data.frame(Text = c("unit #1 stopped at a stop sign on a road. unit #1 was speeding. unit #2 travelling southbound in lane #2 of 3 lanes. unit #2 couldn't react in time and crashed into unit #1. unit #2 was unmindful.",
"unit #1 stopped there. unit #1 was under influence of drug. unit #2 travelling northbound. unit #2 was not unmindful. unit #2 crashed into unit #1.",
"unit #1 was going straight. unit #1 was not speeding. unit #2 travelling southbound in lane #1 of 2 lanes. unit #2 couldn't react in time and crashed into unit #1. unit #2 was driving fast."), id = 1:3)
CodePudding user response:
You could use tidyr's extract
with non-greedy regex:
(Add remove = FALSE
if you want to keep column 1.)
library(tidyverse)
report <- data.frame(Text = c(
"unit #1 stopped at a stop sign on a road. unit #1 was speeding. unit #2 travelling southbound in lane #2 of 3 lanes. unit #2 couldn't react in time and crashed into unit #1. unit #2 was unmindful.",
"unit #1 stopped there. unit #1 was under influence of drug. unit #2 travelling northbound. unit #2 was not unmindful. unit #2 crashed into unit #1.",
"unit #1 was going straight. unit #1 was not speeding. unit #2 travelling southbound in lane #1 of 2 lanes. unit #2 couldn't react in time and crashed into unit #1. unit #2 was driving fast."
), id = 1:3)
df <- report |>
extract(Text, into = c("column 2", "column 3"), regex = "(.*?(?=unit #2))(.*)")
df
#> column 2
#> 1 unit #1 stopped at a stop sign on a road. unit #1 was speeding.
#> 2 unit #1 stopped there. unit #1 was under influence of drug.
#> 3 unit #1 was going straight. unit #1 was not speeding.
#> column 3
#> 1 unit #2 travelling southbound in lane #2 of 3 lanes. unit #2 couldn't react in time and crashed into unit #1. unit #2 was unmindful.
#> 2 unit #2 travelling northbound. unit #2 was not unmindful. unit #2 crashed into unit #1.
#> 3 unit #2 travelling southbound in lane #1 of 2 lanes. unit #2 couldn't react in time and crashed into unit #1. unit #2 was driving fast.
#> id
#> 1 1
#> 2 2
#> 3 3
Created on 2022-06-14 by the reprex package (v2.0.1)