Home > Net >  Extract string within the parenthesis in R
Extract string within the parenthesis in R

Time:01-06

I need to extract some information from a string.

Here is an example dataset

data <- data.frame(id = c(1,2),
                  text = c("GK_Conciencia fonologica (FSS)_Form_Number_1.csv",
                           "G1_Conciencia fonologica (FSL)_Form_Number_3.csv"))

> data
  id                                             text
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv

Basically, I need extract the text inside of the paranthesis and numerical value after Form_Number.

How can I get the desired information below.

  id                                             text  cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv. FSS. 1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv. FSl. 3

CodePudding user response:

A solution using gsub

library(dplyr)

data %>% 
  mutate(cat = gsub(".*\\(|\\).*", "", text), 
         form = gsub(".*Form_Number_|\\.csv$", "", text))
  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3

CodePudding user response:

Using str_extract

library(dplyr)
library(stringr)
data %>% 
    mutate(cat = str_extract(text, "\\(([^)] )", group = 1),
    form = as.integer(str_extract(text, "Number_(\\d )", group = 1)))

-output

  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3

Or with extract

library(tidyr)
 extract(data, text, into = c("cat", "form"), 
    ".*\\(([^)] ).*_Number_(\\d )\\..*", remove = FALSE, 
    convert = TRUE)
  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3

Or using base R

cbind(data, strcapture(".*\\(([^)] )\\)_Form_Number_(\\d )\\..*", 
   data$text, data.frame(cat =character(), form = integer() )))
  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3
  • Related