Home > Software engineering >  separate to split on first number only, not second
separate to split on first number only, not second

Time:02-24

I have a df with a bunch of rownames that represent variables in different zones. Currently, the variables and zones are together in the same text string, and inconsistently formatted.

I'm trying to use stringr to separate the zone codes (alphanumeric) into a different column and eventually pivot and stuff, but I'm having trouble getting my strings to split at only the First occurrence of a number, instead of at every occurrence.

df <- 
  tribble(
    ~variable,  ~value,
    "First Variable 4W",    20,
     "First Variable 4X5Y" , 22,
    "First Variable 5ZE" , 17,
    "First Variable 4X5Y" , 22,
    "First Variable 4Vs W" , 22,
    "Second Variable 4X",  50,
    "Second Variable4XW",  50,
    "Third Variable_4XW",  10,
    "Third Variable 5ZE",  10
  ) %>%
  as.data.frame()


# first tried with str_split_fixed, which separates correctly,
# but I don't know how to access both of the output columns.
# if you use glimpse(), you see that variable.1 and variable.2 are 
# actually one column that's a matrix, so I'm confused. 

df2 <- df %>%   
  mutate(variable = str_replace(variable,"_","")) %>%
  mutate(variable = 
           str_split_fixed(variable, 
                           pattern = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
                           n=2))

# if I use separate, the splits get messed up on rows with merged zone values
# such as 4X5Y, and want to separate into 3 columns instead of two.
df3 <- df %>%
  mutate(variable = str_replace(variable,"_","")) %>%
  tidyr::separate(variable,
                  sep = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
                  fill = "right",
                  into = c("variable", "zone"))

Desired output is a split column where the first is the variable name string, and the second is the code text, all lumped as is into one column.

> df2
#desired output:
        variable        zone  value
1  First Variable         4W    20
2  First Variable       4X5Y    22
3  First Variable        5ZE    17
4  First Variable       4X5Y    22
5  First Variable      4Vs W    22
6 Second Variable         4X    50
7 Second Variable        4XW    50
8  Third Variable        4XW    10
9  Third Variable        5ZE    10

(Note that this is what the output from str_split_fixed looks like, but if you try to manipulate it, you see that in that case it actually comes out as a df of 2 columns, such that the first column is a 2x9 matrix. I want it to come out as 3 columns such that I can use df2$zone or df2 %>% select(zone) to call the second column - in the current form, that is not possible.

Thanks!

CodePudding user response:

Actually you've already solved the problem yourself. Here I just present a way to incorporate the str_split_fixed() result back to your df.

I'll first save the result of str_split_fixed() into a variable, then extract the corresponding columns from this matrix to incorporate back into df. This way, you can flexibly set the column names.

library(tidyverse)

df <- 
  tribble(
    ~variable,  ~value,
    "First Variable 4W",    20,
    "First Variable 4X5Y" , 22,
    "First Variable 5ZE" , 17,
    "First Variable 4X5Y" , 22,
    "First Variable 4Vs W" , 22,
    "Second Variable 4X",  50,
    "Second Variable4XW",  50,
    "Third Variable_4XW",  10,
    "Third Variable 5ZE",  10
  ) 

split_matrix <- str_split_fixed(df$variable, 
                pattern = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
                n=2) %>% 
  gsub("_$", "", .)

df$variable <- split_matrix[, 1]
df$zone <- split_matrix[, 2]

Output

# A tibble: 9 x 3
  variable        value zone 
  <chr>           <dbl> <chr>
1 First Variable     20 4W   
2 First Variable     22 4X5Y 
3 First Variable     17 5ZE  
4 First Variable     22 4X5Y 
5 First Variable     22 4Vs W
6 Second Variable    50 4X   
7 Second Variable    50 4XW  
8 Third Variable     10 4XW  
9 Third Variable     10 5ZE  

dim(df)
[1] 9 3

CodePudding user response:

A simple solution using regexpr to find the position of the first numeric:

position_of_first_numeric <- regexpr("[0-9]", df$variable)

varname <- substr(df$variable, 1, position_of_first_numeric-1)

len_of_string <- lapply(df$variable, nchar)

code_text <- substr(df$variable, position_of_first_numeric, len_of_string)

In varname I stored the part of the string before the first numeric, and in code_text the part after the first numeric, including the numeric.

Edit

To remove spaces, underscores and non-alphanumeric characters from varname:

varname <- gsub("\\W $|_", "", varname)

Edit 2

Now it seems to me best to use tidyr::unnest_wider() after replacing str_replace_fixed by str_replace to get lists in the varname column:

newdf <- df %>%   
  mutate(variable = str_replace(variable,"_","")) %>%
  mutate(variable = 
           str_split(variable, 
                           pattern = "(?<=[a-zA-Z_])\\s*(?=[0-9])",
                           n=2)) %>% 
  unnest_wider(variable)

And then set names as desired.

  • Related