Use tidyr's function `extract` with optional capture group-CodePudding

This question relates directly to that question Extracting data from a single variable to multi variable- multi observations by symbols and text. [{..]} where I've been struggling to use extract with an optional capture group (and have instead resorted to multiple targeted extract operations).

Say I want to extract from this string:

dat <- '{"company":"Orange","location":"","url":"https://www.xyz","positions":[{"title":"CEO","subtitle":"honelulu","description":"","duration":"Dec 2021 - Present 7 months"}] ,"industry":"Non-profit Organizations"},{"company":"Fig","location":"","url":"https://www.xyz2","positions":[{"title":"Business Development Manager","subtitle":"Fig","duration":"Feb 2019 Dec 2021 2 years 11 months",}],},{"company":"Papaya","location":"","url":"https://www.xyz3","positions":[{"title":"Business Development Manager","subtitle":"Pragaya","description":"","duration":"Jan 2018 Oct 2018 10 months",}],"industry":"High Tech"},}]'

five variables: "company", "user_name", "position", "duration", and "industry" but one of the strings does not have data for "industry". That variable, then, (I guess) will have to be defined as an optional capture group in the regexargument. With this code, I feel I'm close - but apparently not close enough, as industry returns empty strings for all three strings:

data.frame(dat) %>%
  # simplify:
  mutate(dat = gsub('["\\]\\[}{]', '', dat, perl = TRUE)) %>%
  # separate:
  separate_rows(dat, sep = '(?<!^)(?=company)') %>%
  extract(dat,
          into = c("company", "user_name", "position", "duration", "industry"),
          regex = 'company:([^,] ).*?url:([^,] ).*?\\btitle:([^,] ).*?duration:([^,] ).*?(?:industry:(.*))?') 
# A tibble: 3 × 5
  company user_name        position                     duration                              industry
  <chr>   <chr>            <chr>                        <chr>                                 <chr>   
1 Orange  https://www.xyz  CEO                          "Dec 2021 - Present 7 months "        ""      
2 Fig     https://www.xyz2 Business Development Manager "Feb 2019 Dec 2021 2 years 11 months" ""      
3 Papaya  https://www.xyz3 Business Development Manager "Jan 2018 Oct 2018 10 months"         ""

How can this regex be tweaked so that it captures the "industry" information where present?

CodePudding user response：

An alternate option would be to get the output in the long format first with separate_rows, then split into two columns ('key', 'value'), and then use pivot_wider to reshape back to wide

library(dplyr)
library(tidyr)
library(stringr)
 data.frame(dat) %>%  
  mutate(dat = gsub('["\\]\\[}{]', '', dat, perl = TRUE)) %>%  
  separate_rows(dat, sep = '(?<!^)(?=company)') %>%
  mutate(rn = row_number()) %>% 
  separate_rows(dat, sep = ",") %>% 
  filter(str_detect(dat,
    str_c(c("company", "url", "\\btitle", "duration", "industry"), collapse = "|"))) %>% 
  separate(dat, into = c("key", "value"), sep = ":",
      fill = "right", extra = "merge") %>% 
  pivot_wider(names_from = key, values_from = value) %>%  
  select(company, user_name = url, position = positions,
    duration, industry) %>% 
  mutate(position = trimws(position, whitespace =".*:"))

-output

# A tibble: 3 × 5
  company user_name        position                     duration                              industry                
  <chr>   <chr>            <chr>                        <chr>                                 <chr>                   
1 Orange  https://www.xyz  CEO                          "Dec 2021 - Present 7 months "        Non-profit Organizations
2 Fig     https://www.xyz2 Business Development Manager "Feb 2019 Dec 2021 2 years 11 months" <NA>                    
3 Papaya  https://www.xyz3 Business Development Manager "Jan 2018 Oct 2018 10 months"         High Tech

CodePudding user response：

You can use

company:([^,] ).*?url:([^,] ).*?\btitle:([^,] ).*?duration:([^,] )(?:.*?industry:([^,]*))?

See the regex demo. In R, do not forget to double escape the backslash.

The main point of interest here is (?:.*?industry:([^,]*))?, that is an optional non-capturing group ((?:...)?) thst matches one or zero occurrences of

.*? - zero or more chars as few as possible
industry: - a literal string
([^,]*) - Group 5: any zero or more chars other than a comma.

R test:

library(tidyr)
dat <- '{"company":"Orange","location":"","url":"https://www.xyz","positions":[{"title":"CEO","subtitle":"honelulu","description":"","duration":"Dec 2021 - Present 7 months"}] ,"industry":"Non-profit Organizations"},{"company":"Fig","location":"","url":"https://www.xyz2","positions":[{"title":"Business Development Manager","subtitle":"Fig","duration":"Feb 2019 Dec 2021 2 years 11 months",}],},{"company":"Papaya","location":"","url":"https://www.xyz3","positions":[{"title":"Business Development Manager","subtitle":"Pragaya","description":"","duration":"Jan 2018 Oct 2018 10 months",}],"industry":"High Tech"},}]'
data.frame(dat) %>%
    # simplify:
    mutate(dat = gsub('["\\]\\[}{]', '', dat, perl = TRUE)) %>%
    # separate:
    separate_rows(dat, sep = '(?<!^)(?=company)') %>%
    extract(dat,
            into = c("company", "user_name", "position", "duration", "industry"),
            regex = 'company:([^,] ).*?url:([^,] ).*?\\btitle:([^,] ).*?duration:([^,] )(?:.*?industry:([^,]*))?')
# A tibble: 3 x 5
  company user_name        position                     duration                              industry                  
  <chr>   <chr>            <chr>                        <chr>                                 <chr>                     
1 Orange  https://www.xyz  CEO                          "Dec 2021 - Present 7 months "        "Non-profit Organizations"
2 Fig     https://www.xyz2 Business Development Manager "Feb 2019 Dec 2021 2 years 11 months" ""                        
3 Papaya  https://www.xyz3 Business Development Manager "Jan 2018 Oct 2018 10 months"         "High Tech"               
>