I need to extract some information from a string.
Here is an example dataset
data <- data.frame(id = c(1,2),
text = c("GK_Conciencia fonologica (FSS)_Form_Number_1.csv",
"G1_Conciencia fonologica (FSL)_Form_Number_3.csv"))
> data
id text
1 1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv
2 2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv
Basically, I need extract the text inside of the paranthesis and numerical value after Form_Number
.
How can I get the desired information below.
id text cat form
1 1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv. FSS. 1
2 2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv. FSl. 3
CodePudding user response:
A solution using gsub
library(dplyr)
data %>%
mutate(cat = gsub(".*\\(|\\).*", "", text),
form = gsub(".*Form_Number_|\\.csv$", "", text))
id text cat form
1 1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS 1
2 2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL 3
CodePudding user response:
Using str_extract
library(dplyr)
library(stringr)
data %>%
mutate(cat = str_extract(text, "\\(([^)] )", group = 1),
form = as.integer(str_extract(text, "Number_(\\d )", group = 1)))
-output
id text cat form
1 1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS 1
2 2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL 3
Or with extract
library(tidyr)
extract(data, text, into = c("cat", "form"),
".*\\(([^)] ).*_Number_(\\d )\\..*", remove = FALSE,
convert = TRUE)
id text cat form
1 1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS 1
2 2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL 3
Or using base R
cbind(data, strcapture(".*\\(([^)] )\\)_Form_Number_(\\d )\\..*",
data$text, data.frame(cat =character(), form = integer() )))
id text cat form
1 1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS 1
2 2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL 3