Split string without delimeters-CodePudding

I have df which contains a column with metro lines. The problem is that L9N and L10N appear sometimes in the same row and I want to split them into two different rows. I have been trying many things but I can't figure out how to do it.

Type	Lines	Year
METRO	L5	1959
METRO	L5	1959
METRO	L5	1959
METRO	L9NL10N	2009
METRO	L9S	2016
METRO	L10S	2018
METRO	L10N	2010
METRO	L4	1926
METRO	L1	1926
METRO	L1	1926

CodePudding user response：

A data.table oneliner.
It uses a lookbehind-lookahead regex "(?<=\[NS])(?=L)" to identify splitting points (between capital N (or S) and capital L), and then splits those rows, keeping the delimeter.

library(data.table)
setDT(mydata)[, .(Lines = unlist(tstrsplit(Lines, "(?<=[NS])(?=L)", perl = TRUE))), by = .(Type, Year)][]

#     Type Year Lines
# 1: METRO 1959    L5
# 2: METRO 1959    L5
# 3: METRO 1959    L5
# 4: METRO 2009   L9N
# 5: METRO 2009  L10N
# 6: METRO 2016   L9S
# 7: METRO 2018  L10S
# 8: METRO 2010  L10N
# 9: METRO 1926    L4
#10: METRO 1926    L1
#11: METRO 1926    L1

edit:
Use regex "(?<=[A-Z])(?=L)" for splitting after any capital letter [A-Z], followed by the capital letter L.

CodePudding user response：

Here is a tidy and efficient way for data.frames:

library(dplyr)

df %>% 
  mutate(Lines = stringr::str_extract_all(Lines, "L\\d*([NSEW]?)")) %>% 
  tidyr::unnest(Lines)

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <dbl>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926

^{Created on 2022-04-01 by the reprex package (v2.0.1)}

It will work for any duplicated line following the pattern: L <som number> <possible one of N, S, E or W>.

Data

df <- tibble::tribble(
  ~Type, ~Lines, ~Year,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L9NL10N",  2009,
  "METRO",  "L9S",  2016,
  "METRO",  "L10S", 2018,
  "METRO",  "L10N", 2010,
  "METRO",  "L4", 1926,
  "METRO",  "L1", 1926,
  "METRO",  "L1",   1926
)

CodePudding user response：

Another possible solution:

library(tidyverse)

df %>% 
  separate_rows(Lines, sep="(?<=[N|S])(?=L)")

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <int>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926