Pick selected elements from long strings in a data.table cell without using a loop-CodePudding

I have a large dataset containing microbial samples. For each species, there's a column containing all samples (by ID) that this species has been found in. The next column then shows how many of the species have been found in the respective samples. Here's a short example datatable:

library(data.table)


data <- data.table(species = c("A","B","C","D"),
                   samples = c("153;16770;6630;34507;16307","16608;6630;5506;26679","23285;35048;37761","21903;24603"),
                   abundances = c("1;101;2;77;1","8;1700;1;1","270;1;3","99;1"))

# split the strings of data into lists
data[,s := strsplit(samples,split=";",fixed=TRUE)]
data[,a := strsplit(abundances,split=";",fixed=TRUE)]

data
#   species                    samples     abundances                          s              a
#1:       A 153;16770;6630;34507;16307   1;101;2;77;1 153,16770,6630,34507,16307   1,101,2,77,1
#2:       B      16608;6630;5506;26679     8;1700;1;1      16608,6630,5506,26679     8,1700,1,1
#3:       C          23285;35048;37761        270;1;3          23285,35048,37761        270,1,3
#4:       D                21903;24603           99;1                21903,24603           99,1

I now need to fill a table that shows me how many individuals of a species show up in a given sample:

# we are only looking at these samples:
selected_samples <- data.table(sample = c("153","16770","6448","37761","5506","7852"))

# create output table...
df <- data.table(matrix(0, ncol = nrow(selected_samples) 1, nrow = nrow(data)))
#rename samples (put letter "s" before sample ID)
colnames(df) <- c("species",data[,selected_samples[,sample]])
df[,species:=data[,species]]

df
#   species 153 16770 6448 37761 5506 7852
#1:       A   0     0    0     0    0    0
#2:       B   0     0    0     0    0    0
#3:       C   0     0    0     0    0    0
#4:       D   0     0    0     0    0    0

So I basically want to go through my data, and whenever a species (i) is found in a sample (j) which is one of the selected samples in df I would like to fill the respective cell df[i,j] with the abundance shown in the data column 'abundances'.

I know I could easily loop through my data and fill my output table, but I would like to use data.table and loops or apply() just seem to make data.table very slow. Do you know of any other way to solve this?

CodePudding user response：

Given there are always as many abundances as samples, here are two ways:

data.table

data[, .(s = unlist(s),
         a = unlist(a)),
     by = species]
#>     species      s      a
#>      <char> <char> <char>
#>  1:       A    153      1
#>  2:       A  16770    101
#>  3:       A   6630      2
#>  4:       A  34507     77
#>  5:       A  16307      1
#>  6:       B  16608      8
#>  7:       B   6630   1700
#>  8:       B   5506      1
#>  9:       B  26679      1
#> 10:       C  23285    270
#> 11:       C  35048      1
#> 12:       C  37761      3
#> 13:       D  21903     99
#> 14:       D  24603      1

tidyr

{tidyr}s unnest function is specifically designed for this kind of task

tidyr::unnest(data, cols = c("s", "a"))
#> # A tibble: 14 × 5
#>    species samples                    abundances   s     a    
#>    <chr>   <chr>                      <chr>        <chr> <chr>
#>  1 A       153;16770;6630;34507;16307 1;101;2;77;1 153   1    
#>  2 A       153;16770;6630;34507;16307 1;101;2;77;1 16770 101  
#>  3 A       153;16770;6630;34507;16307 1;101;2;77;1 6630  2    
#>  4 A       153;16770;6630;34507;16307 1;101;2;77;1 34507 77   
#>  5 A       153;16770;6630;34507;16307 1;101;2;77;1 16307 1    
#>  6 B       16608;6630;5506;26679      8;1700;1;1   16608 8    
#>  7 B       16608;6630;5506;26679      8;1700;1;1   6630  1700 
#>  8 B       16608;6630;5506;26679      8;1700;1;1   5506  1    
#>  9 B       16608;6630;5506;26679      8;1700;1;1   26679 1    
#> 10 C       23285;35048;37761          270;1;3      23285 270  
#> 11 C       23285;35048;37761          270;1;3      35048 1    
#> 12 C       23285;35048;37761          270;1;3      37761 3    
#> 13 D       21903;24603                99;1         21903 99   
#> 14 D       21903;24603                99;1         24603 1

Final Step

Either way, afterwards you can just reshape the data:

dcast.data.table(dt2, formula = species ~ s, fill = 0, value.var = "a")
#> Key: <species>
#>    species    153  16307  16608  16770  21903  23285  24603  26679  34507  35048  37761   5506   6630
#>     <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
#> 1:       A      1      1      0    101      0      0      0      0     77      0      0      0      2
#> 2:       B      0      0      8      0      0      0      0      1      0      0      0      1   1700
#> 3:       C      0      0      0      0      0    270      0      0      0      1      3      0      0
#> 4:       D      0      0      0      0     99      0      1      0      0      0      0      0      0