Home > OS >  R, dplyr split dataframe by string in columns
R, dplyr split dataframe by string in columns

Time:02-08

I've seen examples here that show how to split a dataframe using column index ranges but how do I split my dataframe with dplyr using strings found in the column? I've purposely created my dataframes so the columns have certain strings in them for future splitting.

Example data:

Site   A_Argas   A_Arachnicea   A_Brus   B_Burkoll   B_Brielle  B_Bact
 1       10        0              0         0           0         0
 2       0         0              0         10          22        123
 3       1         2              3         88          12        546

I want to split this dataframe up based on strings such as "A_" or "B_" and assign them to new dataframes.

For example the output would be:

 dataframeA
 Site   A_Argas   A_Arachnicea   A_Brus 
 1       10        0              0        
 2       0         0              0        
 3       1         2              3       

 dataframeB
 Site   B_Burkoll   B_Brielle  B_Bact
  1       0           0         0
  2      10          22        123
  3      88          12        546

Because this data is not in long format, I can't seem to figure how to change my old code that I used to split the longform dataframes (for a different analysis).

dataframeA <- data %>% filter("GroupID" == "Arachnids") # where "A_" in column headers signify arachnid species
dataframeB <- data %>% filter("GroupID" == "Bacteria") # where all "B_" in the column headers are bacterial species

CodePudding user response:

One base R option might be to grep for the columns starting with A_ and B_, then to subset your original data frame.

a_names <- grep("^A_", names(data), value=TRUE)
dataframeA <- data[ , c("Site", a_names)]
b_names <- grep("^B_", names(data), value=TRUE)
dataframeB <- data[ , c("Site", b_names)]

CodePudding user response:

Another possible solution, based on group_split, pivot_longer, pivot_wider and map:

library(tidyverse)

df <- structure(list(Site = c(1, 2, 3), A_Argas = c(10, 0, 1), A_Arachnicea = c(0, 
0, 2), A_Brus = c(0, 0, 3), B_Burkoll = c(0, 10, 88), B_Brielle = c(0, 
22, 12), B_Bact = c(0, 123, 546)), row.names = c(NA, -3L), class = "data.frame")

df %>%
  pivot_longer(-1) %>% 
  group_split(str_extract(name, "^[A-Z]\\_")) %>% 
  map(~ pivot_wider(.x, id_cols = Site))

#> [[1]]
#> # A tibble: 3 × 4
#>    Site A_Argas A_Arachnicea A_Brus
#>   <dbl>   <dbl>        <dbl>  <dbl>
#> 1     1      10            0      0
#> 2     2       0            0      0
#> 3     3       1            2      3
#> 
#> [[2]]
#> # A tibble: 3 × 4
#>    Site B_Burkoll B_Brielle B_Bact
#>   <dbl>     <dbl>     <dbl>  <dbl>
#> 1     1         0         0      0
#> 2     2        10        22    123
#> 3     3        88        12    546

CodePudding user response:

We may use split.default for this in base R, remove the first column ('Site'), split the data into a list of dataframe by removing the substring from the _ (using trimws), cbind the first column in the list of data.frames and change the list name

lst1 <- split.default(data[-1], trimws(names(data)[-1], whitespace = "_.*"))

lst2 <- Map(cbind, Site = data['Site'], lst1)
names(lst2) <- paste0("dataframe", names(lst1))

-output

> lst2
$dataframeA
  Site A_Argas A_Arachnicea A_Brus
1    1      10            0      0
2    2       0            0      0
3    3       1            2      3

$dataframeB
  Site B_Burkoll B_Brielle B_Bact
1    1         0         0      0
2    2        10        22    123
3    3        88        12    546

It may be better to keep it in a list and access the elements as

lst2[["dataframeA"]]
lst2$dataframeA

instead of creating multiple objects (although can be created with list2env

list2env(lst2, .GlobalEnv)

With tidyverse, we could loop over the substring of the column names and select the relevant columns with map

library(purrr)
library(dplyr)
library(stringr)
str_remove(names(data)[-1] , "_.*") %>%
   unique %>% 
   map(~ data %>% 
   select(Site, starts_with(.x)))

-output

[[1]]
  Site A_Argas A_Arachnicea A_Brus
1    1      10            0      0
2    2       0            0      0
3    3       1            2      3

[[2]]
  Site B_Burkoll B_Brielle B_Bact
1    1         0         0      0
2    2        10        22    123
3    3        88        12    546

data

data <- structure(list(Site = 1:3, A_Argas = c(10L, 0L, 1L), A_Arachnicea = c(0L, 
0L, 2L), A_Brus = c(0L, 0L, 3L), B_Burkoll = c(0L, 10L, 88L), 
    B_Brielle = c(0L, 22L, 12L), B_Bact = c(0L, 123L, 546L)), 
class = "data.frame", row.names = c(NA, 
-3L))
  •  Tags:  
  • Related