I need to copy numbers between two text strings in column, "transpose" it to columns next to the first text string and repeat through the rest of the column. Dataset has hundreds of rows.
The "alignment" (like putting NAs for correcting positions of numbers) is not necessary, transposing would be more than enough.
input:
df1 <- structure(list(column1 = c("STOUT", "18", "9341", "4", "0,2005",
"STOUT", "1", "9341", "25", "0,2004", "STIN", "7", "9341", "0,2003",
"OFF", "7", "L(1)", "9342", "0,2005")), class = "data.frame", row.names = c(NA,
-19L))
> print(df)
column1
1 STOUT
2 18
3 9341
4 4
5 0,2005
6 STOUT
7 1
8 9341
9 25
10 0,2004
11 STIN
12 7
13 9341
14 0,2003
15 OFF
16 7
17 L(1)
18 9342
19 0,2005
desired output:
df2 <- structure(list(column1 = c("STOUT", "STOUT", "STIN", "OFF", "L(1)"
), column2 = c(18L, 1L, 7L, 7L, NA), column3 = c(9341L, 9341L,
9341L, NA, 9342L), column4 = c(4L, 25L, NA, NA, NA), column5 = c(0.2005,
0.2004, 0.2003, NA, 0.2005)), class = "data.frame", row.names = c(NA,
-5L))
> print(df2)
column1 column2 column3 column4 column5
1 STOUT 18 9341 4 0.2005
2 STOUT 1 9341 25 0.2004
3 STIN 7 9341 NA 0.2003
4 OFF 7 NA NA NA
5 L(1) NA 9342 NA 0.2005
I was thinking along Extracting a string between other two strings in R
But did not make much progress :-/
Thanks in advance.
CodePudding user response:
Split on a regex-based logical:
(I defined the break points based on the presence of capital letters (i.e. [A-Z]
); you may want to modify the pattern based on your expected break points in df1$column1
.)
a <- split(df1$column1, cumsum(grepl('[A-Z]', df1$column1)))
a
$`1`
[1] "STOUT" "18" "9341" "4" "0,2005"
$`2`
[1] "STOUT" "1" "9341" "25" "0,2004"
$`3`
[1] "STIN" "7" "9341" "0,2003"
$`4`
[1] "OFF" "7"
$`5`
[1] "L(1)" "9342" "0,2005"
Then rbind()
and fill with NA
:
(plyr::rbind.fill()
expects a dataframe, so I'm using lapply()
to call as.data.frame()
to each list element.)
library(plyr)
plyr::rbind.fill(lapply(a,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
V1 V2 V3 V4 V5
1 STOUT 18 9341 4 0,2005
2 STOUT 1 9341 25 0,2004
3 STIN 7 9341 0,2003 <NA>
4 OFF 7 <NA> <NA> <NA>
5 L(1) 9342 0,2005 <NA> <NA>
CodePudding user response:
Here is a base R option using read.csv
by
grepl
to make it
df2 <- with(
df1,
read.csv(
text = paste0(
by(column1, cumsum(grepl("^\\D", column1)), toString),
collapse = "\n"
),
header = FALSE
)
)
which gives
> df2
V1 V2 V3 V4 V5 V6
1 STOUT 18 9341 4 0 2005
2 STOUT 1 9341 25 0 2004
3 STIN 7 9341 0 2003 NA
4 OFF 7 NA NA NA NA
5 L(1) 9342 0 2005 NA NA
> str(df2)
'data.frame': 5 obs. of 6 variables:
$ V1: chr "STOUT" "STOUT" "STIN" "OFF" ...
$ V2: int 18 1 7 7 9342
$ V3: int 9341 9341 9341 NA 0
$ V4: int 4 25 0 NA 2005
$ V5: int 0 0 2003 NA NA
$ V6: int 2005 2004 NA NA NA
CodePudding user response:
Another possible solution, based on tidyverse
:
library(tidyverse)
df1 %>%
group_by(aux = cumsum(str_detect(column1, "[^\\d|,]"))) %>%
mutate(x = rep(first(column1), n())) %>%
filter(!str_detect(column1, "[^\\d|,]")) %>%
mutate(name = paste0("column", 2:(n() 1))) %>%
ungroup %>%
pivot_wider(c(aux, x), values_from = column1) %>%
select(-aux, column1 = x)
#> # A tibble: 5 × 5
#> column1 column2 column3 column4 column5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 STOUT 18 9341 4 0,2005
#> 2 STOUT 1 9341 25 0,2004
#> 3 STIN 7 9341 0,2003 <NA>
#> 4 OFF 7 <NA> <NA> <NA>
#> 5 L(1) 9342 0,2005 <NA> <NA>