Separating a column with multiple different entries with tidyr-CodePudding

I am trying to split up one column in a data frame that shows the period active(s) for several artists/ bands into two columns (start_of_career, end_of_career). The variable class is character. I tried to use tidyrs separate function for it and when I run it, I see that it is split in the console but not in the data frame itself, so I assume that it doesn't work properly.

Please see here a made up example of the data I want to split:

Column A	Column B
Artist A	1995-present
Artist B	1995-1997, 2008, 2010-present

As you can see, some rows will consists only of a start and end date, while others have several dates. All I actually need is the first number and the last, e.g. for Artist B I need only start_of_career 1995 and end_of_career "present". But I am somehow not able to solve this issue.

The code I used was:

library(tidyr)
df %>% separate(col = period_active, into = c('start_of_career', 'end_of_career'), sep = '-')

I also tried other separators(",", " "), but it didn't work either.

I also tried:

df$start_of_career = strsplit(df$period_active, split = '-')

But this didn't work as well.

CodePudding user response：

Using df shown reproducibly in the Note at the end remove everything except first and last parts of Column B and then separate what is left.

library(dplyr)
library(tidyr)

dd %>%
  mutate(`Column B` = sub("-.*-", "-", `Column B`)) %>%
  separate(`Column B`, c("start", "end"))
##   Column A start     end
## 1 Artist A  1995 present
## 2 Artist B  1995 present

Note

df <- 
structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present", 
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA, 
-2L))

CodePudding user response：

We could use separate_rows and then filter for first and last row of group:

library(tidyr)
library(dplyr)

df %>% 
  separate_rows(Column.B) %>% 
  group_by(Column.A) %>% 
  filter(row_number()==1 | row_number()==n()) %>% 
  mutate(Colum.C = c("start", "end"))

  Column.A Column.B Colum.C
  <chr>    <chr>    <chr>  
1 Artist A 1995     start  
2 Artist A present  end    
3 Artist B 1995     start  
4 Artist B present  end

data:

structure(list(Column.A = c("Artist A", "Artist B"), Column.B = c("1995-present", 
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA, 
-2L))

CodePudding user response：

Using base R

df <- cbind(df[1], read.table(text = sub("-[0-9, ] ", "", df$`Column B`),
    header = FALSE, col.names = c("start", "end"), sep = "-"))

-output

> df
  Column A start     end
1 Artist A  1995 present
2 Artist B  1995 present

We could do this with separate as well

library(tidyr)
separate(df, `Column B`, into = c("start", "end"), sep = "-[^A-Za-z]*")
  Column A start     end
1 Artist A  1995 present
2 Artist B  1995 present

data

df <- structure(list(`Column A` = c("Artist A", "Artist B"), 
`Column B` = c("1995-present", 
"1995-1997, 2008, 2010-present")), class = "data.frame",
 row.names = c(NA, 
-2L))

CodePudding user response：

Using strsplit and then subsequently pick the first and the last entry.

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(splitrow = strsplit(`Column B`, "-"), 
    start_of_career = splitrow[1], 
    end_of_career = splitrow[length(splitrow)], 
    splitrow = NULL) %>% 
  ungroup()
# A tibble: 2 × 4
  `Column A` `Column B`                    start_of_career end_of_career
  <chr>      <chr>                         <chr>           <chr>
1 Artist A   1995-present                  1995            present
2 Artist B   1995-1997, 2008, 2010-present 1995            present

Data

df <- structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

CodePudding user response：

Another option: use strsplit, and return the list of start and end values

f <- \(v) {
  v = strsplit(v, "-|,| ")[[1]]
  list(start = v[1],end = v[length(v)])
}

df %>% 
  mutate(df, `Column B` = lapply(`Column B`,f)) %>%
  unnest_wider(`Column B`)

Output:

# A tibble: 2 × 3
  `Column A` start end    
  <chr>      <chr> <chr>  
1 Artist A   1995  present
2 Artist B   1995  present

CodePudding user response：

Below code extract the first word before the dash and last word after.

for(i in 1:length(df))
{
df$start[i] <-sub("-.*", "", df$`Column B`[i])
df$end[i] <-sub("^. -", "", df$`Column B`[i])

}