I have df which contains a column with metro lines. The problem is that L9N and L10N appear sometimes in the same row and I want to split them into two different rows. I have been trying many things but I can't figure out how to do it.
Type | Lines | Year |
---|---|---|
METRO | L5 | 1959 |
METRO | L5 | 1959 |
METRO | L5 | 1959 |
METRO | L9NL10N | 2009 |
METRO | L9S | 2016 |
METRO | L10S | 2018 |
METRO | L10N | 2010 |
METRO | L4 | 1926 |
METRO | L1 | 1926 |
METRO | L1 | 1926 |
CodePudding user response:
A data.table
oneliner.
It uses a lookbehind-lookahead regex "(?<=\[NS])(?=L)"
to identify splitting points (between capital N (or S) and capital L), and then splits those rows, keeping the delimeter.
library(data.table)
setDT(mydata)[, .(Lines = unlist(tstrsplit(Lines, "(?<=[NS])(?=L)", perl = TRUE))), by = .(Type, Year)][]
# Type Year Lines
# 1: METRO 1959 L5
# 2: METRO 1959 L5
# 3: METRO 1959 L5
# 4: METRO 2009 L9N
# 5: METRO 2009 L10N
# 6: METRO 2016 L9S
# 7: METRO 2018 L10S
# 8: METRO 2010 L10N
# 9: METRO 1926 L4
#10: METRO 1926 L1
#11: METRO 1926 L1
edit:
Use regex "(?<=[A-Z])(?=L)"
for splitting after any capital letter [A-Z]
, followed by the capital letter L
.
CodePudding user response:
Here is a tidy and efficient way for data.frames:
library(dplyr)
df %>%
mutate(Lines = stringr::str_extract_all(Lines, "L\\d*([NSEW]?)")) %>%
tidyr::unnest(Lines)
#> # A tibble: 11 × 3
#> Type Lines Year
#> <chr> <chr> <dbl>
#> 1 METRO L5 1959
#> 2 METRO L5 1959
#> 3 METRO L5 1959
#> 4 METRO L9N 2009
#> 5 METRO L10N 2009
#> 6 METRO L9S 2016
#> 7 METRO L10S 2018
#> 8 METRO L10N 2010
#> 9 METRO L4 1926
#> 10 METRO L1 1926
#> 11 METRO L1 1926
Created on 2022-04-01 by the reprex package (v2.0.1)
It will work for any duplicated line following the pattern: L <som number> <possible one of N, S, E or W>
.
Data
df <- tibble::tribble(
~Type, ~Lines, ~Year,
"METRO", "L5", 1959,
"METRO", "L5", 1959,
"METRO", "L5", 1959,
"METRO", "L9NL10N", 2009,
"METRO", "L9S", 2016,
"METRO", "L10S", 2018,
"METRO", "L10N", 2010,
"METRO", "L4", 1926,
"METRO", "L1", 1926,
"METRO", "L1", 1926
)
CodePudding user response:
Another possible solution:
library(tidyverse)
df %>%
separate_rows(Lines, sep="(?<=[N|S])(?=L)")
#> # A tibble: 11 × 3
#> Type Lines Year
#> <chr> <chr> <int>
#> 1 METRO L5 1959
#> 2 METRO L5 1959
#> 3 METRO L5 1959
#> 4 METRO L9N 2009
#> 5 METRO L10N 2009
#> 6 METRO L9S 2016
#> 7 METRO L10S 2018
#> 8 METRO L10N 2010
#> 9 METRO L4 1926
#> 10 METRO L1 1926
#> 11 METRO L1 1926