This question relates directly to that question Extracting data from a single variable to multi variable- multi observations by symbols and text. [{..]} where I've been struggling to use extract
with an optional capture group (and have instead resorted to multiple targeted extract
operations).
Say I want to extract from this string:
dat <- '{"company":"Orange","location":"","url":"https://www.xyz","positions":[{"title":"CEO","subtitle":"honelulu","description":"","duration":"Dec 2021 - Present 7 months"}] ,"industry":"Non-profit Organizations"},{"company":"Fig","location":"","url":"https://www.xyz2","positions":[{"title":"Business Development Manager","subtitle":"Fig","duration":"Feb 2019 Dec 2021 2 years 11 months",}],},{"company":"Papaya","location":"","url":"https://www.xyz3","positions":[{"title":"Business Development Manager","subtitle":"Pragaya","description":"","duration":"Jan 2018 Oct 2018 10 months",}],"industry":"High Tech"},}]'
five variables: "company", "user_name", "position", "duration", and "industry" but one of the strings does not have data for "industry". That variable, then, (I guess) will have to be defined as an optional capture group in the regex
argument. With this code, I feel I'm close - but apparently not close enough, as industry
returns empty strings for all three strings:
data.frame(dat) %>%
# simplify:
mutate(dat = gsub('["\\]\\[}{]', '', dat, perl = TRUE)) %>%
# separate:
separate_rows(dat, sep = '(?<!^)(?=company)') %>%
extract(dat,
into = c("company", "user_name", "position", "duration", "industry"),
regex = 'company:([^,] ).*?url:([^,] ).*?\\btitle:([^,] ).*?duration:([^,] ).*?(?:industry:(.*))?')
# A tibble: 3 × 5
company user_name position duration industry
<chr> <chr> <chr> <chr> <chr>
1 Orange https://www.xyz CEO "Dec 2021 - Present 7 months " ""
2 Fig https://www.xyz2 Business Development Manager "Feb 2019 Dec 2021 2 years 11 months" ""
3 Papaya https://www.xyz3 Business Development Manager "Jan 2018 Oct 2018 10 months" ""
How can this regex
be tweaked so that it captures the "industry" information where present?
CodePudding user response:
An alternate option would be to get the output in the long format first with separate_rows
, then split into two columns ('key', 'value'), and then use pivot_wider
to reshape back to wide
library(dplyr)
library(tidyr)
library(stringr)
data.frame(dat) %>%
mutate(dat = gsub('["\\]\\[}{]', '', dat, perl = TRUE)) %>%
separate_rows(dat, sep = '(?<!^)(?=company)') %>%
mutate(rn = row_number()) %>%
separate_rows(dat, sep = ",") %>%
filter(str_detect(dat,
str_c(c("company", "url", "\\btitle", "duration", "industry"), collapse = "|"))) %>%
separate(dat, into = c("key", "value"), sep = ":",
fill = "right", extra = "merge") %>%
pivot_wider(names_from = key, values_from = value) %>%
select(company, user_name = url, position = positions,
duration, industry) %>%
mutate(position = trimws(position, whitespace =".*:"))
-output
# A tibble: 3 × 5
company user_name position duration industry
<chr> <chr> <chr> <chr> <chr>
1 Orange https://www.xyz CEO "Dec 2021 - Present 7 months " Non-profit Organizations
2 Fig https://www.xyz2 Business Development Manager "Feb 2019 Dec 2021 2 years 11 months" <NA>
3 Papaya https://www.xyz3 Business Development Manager "Jan 2018 Oct 2018 10 months" High Tech
CodePudding user response:
You can use
company:([^,] ).*?url:([^,] ).*?\btitle:([^,] ).*?duration:([^,] )(?:.*?industry:([^,]*))?
See the regex demo. In R, do not forget to double escape the backslash.
The main point of interest here is (?:.*?industry:([^,]*))?
, that is an optional non-capturing group ((?:...)?
) thst matches one or zero occurrences of
.*?
- zero or more chars as few as possibleindustry:
- a literal string([^,]*)
- Group 5: any zero or more chars other than a comma.
R test:
library(tidyr)
dat <- '{"company":"Orange","location":"","url":"https://www.xyz","positions":[{"title":"CEO","subtitle":"honelulu","description":"","duration":"Dec 2021 - Present 7 months"}] ,"industry":"Non-profit Organizations"},{"company":"Fig","location":"","url":"https://www.xyz2","positions":[{"title":"Business Development Manager","subtitle":"Fig","duration":"Feb 2019 Dec 2021 2 years 11 months",}],},{"company":"Papaya","location":"","url":"https://www.xyz3","positions":[{"title":"Business Development Manager","subtitle":"Pragaya","description":"","duration":"Jan 2018 Oct 2018 10 months",}],"industry":"High Tech"},}]'
data.frame(dat) %>%
# simplify:
mutate(dat = gsub('["\\]\\[}{]', '', dat, perl = TRUE)) %>%
# separate:
separate_rows(dat, sep = '(?<!^)(?=company)') %>%
extract(dat,
into = c("company", "user_name", "position", "duration", "industry"),
regex = 'company:([^,] ).*?url:([^,] ).*?\\btitle:([^,] ).*?duration:([^,] )(?:.*?industry:([^,]*))?')
# A tibble: 3 x 5
company user_name position duration industry
<chr> <chr> <chr> <chr> <chr>
1 Orange https://www.xyz CEO "Dec 2021 - Present 7 months " "Non-profit Organizations"
2 Fig https://www.xyz2 Business Development Manager "Feb 2019 Dec 2021 2 years 11 months" ""
3 Papaya https://www.xyz3 Business Development Manager "Jan 2018 Oct 2018 10 months" "High Tech"
>