Extract all the unique values for a given substring in the column names-CodePudding

I have a dataframe where the colunm names have a structure as follows:

Barcelona.Standard.2012.True
Berlin.One.2013.True
London.One.2014.True
Barcelona.Standard.2015.True
Berlin.One.2016.True

As you can see, each columns specifies the City, Type of bank account, Year it was open and if it's Active or not.

I want to extract into a list, all the different possibilities for each of the categories. For example, for the first category, which is the city, we would get:

Barcelona Berlin London

What I though of doing is splitting by the symbol . and then getting all the unique values for a given position for all the columns.

I could do this with a loop, but I would like to do it with a sapply, if that's possible.

Something like:

lapply(strsplit(colnames(dat), split = "\\.")[[]][1])

Where [[]] would be all the columns.

This is just a toy example, the real dataset has thousands of columns.

CodePudding user response：

With sapply:

sapply(transpose(strsplit(col, "\\.")), function(x) unlist(unique(x), recursive = F))

Or use data.table::transpose instead of transpose to make it easier:

sapply(data.table::transpose(strsplit(col, "\\.")), unique)

Finally, use setNames to set the names:

sapply(transpose(strsplit(col, "\\.")), function(x) unlist(unique(x), recursive = F)) |>
  setNames(c("City", "Type", "Year", "Active"))

output:

$City
[1] "Barcelona" "Berlin"    "London"   

$Type
[1] "Standard" "One"     

$Year
[1] "2012" "2013" "2014" "2015" "2016"

$Active
[1] "True"

data

col <- c("Barcelona.Standard.2012.True",
  "Berlin.One.2013.True",
  "London.One.2014.True",
  "Barcelona.Standard.2015.True",
  "Berlin.One.2016.True")

CodePudding user response：

Here is a way with tidyr::separate and the new base R >= 4.1 pipe operator and lambda.

tidyr::separate(
  dat, x, 
  into = c("City", "Type", "Year", "Active"),
  sep = "[^[:alnum:]]"
) |>
  as.list() |>
  (\(x) Map(unique, x))()
#> $City
#> [1] "Barcelona" "Berlin"    "London"   
#> 
#> $Type
#> [1] "Standard" "One"     
#> 
#> $Year
#> [1] "2012" "2013" "2014" "2015" "2016"
#> 
#> $Active
#> [1] "True"

^{Created on 2022-02-14 by the reprex package (v2.0.1)}

Edit

Simpler, with lapply instead of Map.

tidyr::separate(
  dat, x, 
  into = c("City", "Type", "Year", "Active"),
  sep = "[^[:alnum:]]"
) |>
  (\(x) lapply(x, unique))()
#> $City
#> [1] "Barcelona" "Berlin"    "London"   
#> 
#> $Type
#> [1] "Standard" "One"     
#> 
#> $Year
#> [1] "2012" "2013" "2014" "2015" "2016"
#> 
#> $Active
#> [1] "True"

^{Created on 2022-02-14 by the reprex package (v2.0.1)}

Data

x <- scan(text = "
Barcelona.Standard.2012.True
Berlin.One.2013.True
London.One|2014.True
Barcelona.Standard.2015.True
Berlin.One.2016.True", what = character())

dat <- data.frame(x)

^{Created on 2022-02-14 by the reprex package (v2.0.1)}