Home > Software engineering >  Split string without delimeters
Split string without delimeters

Time:04-01

I have df which contains a column with metro lines. The problem is that L9N and L10N appear sometimes in the same row and I want to split them into two different rows. I have been trying many things but I can't figure out how to do it.

Type Lines Year
METRO L5 1959
METRO L5 1959
METRO L5 1959
METRO L9NL10N 2009
METRO L9S 2016
METRO L10S 2018
METRO L10N 2010
METRO L4 1926
METRO L1 1926
METRO L1 1926

CodePudding user response:

A data.table oneliner.
It uses a lookbehind-lookahead regex "(?<=\[NS])(?=L)" to identify splitting points (between capital N (or S) and capital L), and then splits those rows, keeping the delimeter.

library(data.table)
setDT(mydata)[, .(Lines = unlist(tstrsplit(Lines, "(?<=[NS])(?=L)", perl = TRUE))), by = .(Type, Year)][]

#     Type Year Lines
# 1: METRO 1959    L5
# 2: METRO 1959    L5
# 3: METRO 1959    L5
# 4: METRO 2009   L9N
# 5: METRO 2009  L10N
# 6: METRO 2016   L9S
# 7: METRO 2018  L10S
# 8: METRO 2010  L10N
# 9: METRO 1926    L4
#10: METRO 1926    L1
#11: METRO 1926    L1

edit:
Use regex "(?<=[A-Z])(?=L)" for splitting after any capital letter [A-Z], followed by the capital letter L.

CodePudding user response:

Here is a tidy and efficient way for data.frames:

library(dplyr)

df %>% 
  mutate(Lines = stringr::str_extract_all(Lines, "L\\d*([NSEW]?)")) %>% 
  tidyr::unnest(Lines)

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <dbl>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926

Created on 2022-04-01 by the reprex package (v2.0.1)

It will work for any duplicated line following the pattern: L <som number> <possible one of N, S, E or W>.

Data

df <- tibble::tribble(
  ~Type, ~Lines, ~Year,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L9NL10N",  2009,
  "METRO",  "L9S",  2016,
  "METRO",  "L10S", 2018,
  "METRO",  "L10N", 2010,
  "METRO",  "L4", 1926,
  "METRO",  "L1", 1926,
  "METRO",  "L1",   1926
)

CodePudding user response:

Another possible solution:

library(tidyverse)

df %>% 
  separate_rows(Lines, sep="(?<=[N|S])(?=L)")

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <int>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926
  • Related