I have a df with a bunch of rownames that represent variables in different zones. Currently, the variables and zones are together in the same text string, and inconsistently formatted.
I'm trying to use stringr
to separate the zone codes (alphanumeric) into a different column and eventually pivot and stuff, but I'm having trouble getting my strings to split at only the First occurrence of a number, instead of at every occurrence.
df <-
tribble(
~variable, ~value,
"First Variable 4W", 20,
"First Variable 4X5Y" , 22,
"First Variable 5ZE" , 17,
"First Variable 4X5Y" , 22,
"First Variable 4Vs W" , 22,
"Second Variable 4X", 50,
"Second Variable4XW", 50,
"Third Variable_4XW", 10,
"Third Variable 5ZE", 10
) %>%
as.data.frame()
# first tried with str_split_fixed, which separates correctly,
# but I don't know how to access both of the output columns.
# if you use glimpse(), you see that variable.1 and variable.2 are
# actually one column that's a matrix, so I'm confused.
df2 <- df %>%
mutate(variable = str_replace(variable,"_","")) %>%
mutate(variable =
str_split_fixed(variable,
pattern = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
n=2))
# if I use separate, the splits get messed up on rows with merged zone values
# such as 4X5Y, and want to separate into 3 columns instead of two.
df3 <- df %>%
mutate(variable = str_replace(variable,"_","")) %>%
tidyr::separate(variable,
sep = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
fill = "right",
into = c("variable", "zone"))
Desired output is a split column where the first is the variable name string, and the second is the code text, all lumped as is into one column.
> df2
#desired output:
variable zone value
1 First Variable 4W 20
2 First Variable 4X5Y 22
3 First Variable 5ZE 17
4 First Variable 4X5Y 22
5 First Variable 4Vs W 22
6 Second Variable 4X 50
7 Second Variable 4XW 50
8 Third Variable 4XW 10
9 Third Variable 5ZE 10
(Note that this is what the output from str_split_fixed
looks like, but if you try to manipulate it, you see that in that case it actually comes out as a df of 2 columns, such that the first column is a 2x9 matrix. I want it to come out as 3 columns such that I can use df2$zone
or df2 %>% select(zone)
to call the second column - in the current form, that is not possible.
Thanks!
CodePudding user response:
Actually you've already solved the problem yourself. Here I just present a way to incorporate the str_split_fixed()
result back to your df
.
I'll first save the result of str_split_fixed()
into a variable, then extract the corresponding columns from this matrix to incorporate back into df
. This way, you can flexibly set the column names.
library(tidyverse)
df <-
tribble(
~variable, ~value,
"First Variable 4W", 20,
"First Variable 4X5Y" , 22,
"First Variable 5ZE" , 17,
"First Variable 4X5Y" , 22,
"First Variable 4Vs W" , 22,
"Second Variable 4X", 50,
"Second Variable4XW", 50,
"Third Variable_4XW", 10,
"Third Variable 5ZE", 10
)
split_matrix <- str_split_fixed(df$variable,
pattern = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
n=2) %>%
gsub("_$", "", .)
df$variable <- split_matrix[, 1]
df$zone <- split_matrix[, 2]
Output
# A tibble: 9 x 3
variable value zone
<chr> <dbl> <chr>
1 First Variable 20 4W
2 First Variable 22 4X5Y
3 First Variable 17 5ZE
4 First Variable 22 4X5Y
5 First Variable 22 4Vs W
6 Second Variable 50 4X
7 Second Variable 50 4XW
8 Third Variable 10 4XW
9 Third Variable 10 5ZE
dim(df)
[1] 9 3
CodePudding user response:
A simple solution using regexpr
to find the position of the first numeric:
position_of_first_numeric <- regexpr("[0-9]", df$variable)
varname <- substr(df$variable, 1, position_of_first_numeric-1)
len_of_string <- lapply(df$variable, nchar)
code_text <- substr(df$variable, position_of_first_numeric, len_of_string)
In varname
I stored the part of the string before the first numeric, and in code_text
the part after the first numeric, including the numeric.
Edit
To remove spaces, underscores and non-alphanumeric characters from varname
:
varname <- gsub("\\W $|_", "", varname)
Edit 2
Now it seems to me best to use tidyr::unnest_wider()
after replacing str_replace_fixed
by str_replace
to get lists in the varname
column:
newdf <- df %>%
mutate(variable = str_replace(variable,"_","")) %>%
mutate(variable =
str_split(variable,
pattern = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
n=2)) %>%
unnest_wider(variable)
And then set names as desired.