I'm trying to split a text variable that goes like this:
text = "name name name (1235-23-532)"
to something like this:
name = "name name name"
num = "1235-23-532"
I'm trying this code:
df_split <- df %>%
separate(owners,
into = c("name", "num"),
sep = "(?<=[A-Za-z])(?=\\()"
)
However, it results in the number counterpart being NA. I'm confused how it doesn't detect parenthesis (I tried both ( and \( and it doesn't work either way). Is there a good solution for this?
Also: there are some rows that has two parentheses pairs like: "name name name (name) (number)" - any good way to extract just the numbers?
Thank you very much.
CodePudding user response:
Here is one way how to get your desired output:
library(tidyverse)
as_tibble(text) %>%
mutate(name = str_trim(gsub("[^a-zA-Z]", " ", value)),
num = str_extract(value, '\\d \\-\\d \\-\\d '), .keep="unused")
# A tibble: 1 x 2
name num
<chr> <chr>
1 name name name 1235-23-532
OR:
library(tidyverse)
as_tibble(text) %>%
separate(value, c("name", "num"), sep = ' \\(') %>%
mutate(num = str_remove(num, '\\)'))
CodePudding user response:
I don't have a way to prevent the "NA", but I do have a workaround I use when I have this problem. I use the mutate fct_recode
function to the "NA" to equal the proper variable name (reference).
For example
%>% mutate(Column_Name = fct_recode(Column_Name, "new_name" = "NA"))
This works for me, it's not perfect but it fixes the problem.