Newbie question! I have a column with strings of two differents fixed widths formats. We can recognize the type of format by its name and split the string according to the format.
df <- data.frame(
var1 = c('M1B123456789MM1158','M1C123456789zMM1183'),
var2 = c('code1','code8'))
The fixed widths formats are:
formatM1B = c(3,9,2,4)
formatM1C = c(3,9,1,2,4)
So i hope this result:
|format|var1_2 |var1_3|var1_5|var1_6|code |
1|M1B |123456789| |MM |1158 |code1|
2|M1C |123456789|z |MM |1183 |code8|
I tried the functions separate , str_split or str_split_fixed but i don't know how combine it with a sort of IF function to "test" or "regex" the format mentionned into the string. This question has certainly been asked a lot of time, i did hours research without being able to find something to adapt to my data :/
CodePudding user response:
library(tidyverse)
df %>%
extract(col= var1,into = c('format','1','2','3','4'),
regex = "^(M[1-9][A-Z])([1-9]{9})(z)?(M{2})([1-9]{4})")
The regex expresion has 5 groups:
- (M[1-9][A-Z]): Search for a M, a int: 1,...,9, and an uppercase letter
- ([1-9]{9}): Search for 9 int numbers: 1,...,9
- (z)?: Search if there is a z or skip
- (M{2}): Search for 2 M
- ([1-9]{4}): Search for 4 int numbers: 1,...,9
Output:
format 1 2 3 4 var2
1 M1B 123456789 MM 1158 code1
2 M1C 123456789 z MM 1183 code8
CodePudding user response:
Here is a function that does the splitting based on your formatM1B/C
vectors,
f1 <- function(string, vec){
start <- c(1, cumsum(vec)[-length(vec)] 1)
end <- cumsum(vec)
apply(data.frame(start, end), 1, function(i)substring(string, i[1], i[2]))
}
And we can apply it as,
Map(function(x, y) f1(x, y), df$var1,list(formatM1B, formatM1C))
#$M1B123456789MM1158
#[1] "M1B" "123456789" "MM" "1158"
#$M1C123456789zMM1183
#[1] "M1C" "123456789" "z" "MM" "1183"