I have a large dataset containing microbial samples. For each species, there's a column containing all samples (by ID) that this species has been found in. The next column then shows how many of the species have been found in the respective samples. Here's a short example datatable:
library(data.table)
data <- data.table(species = c("A","B","C","D"),
samples = c("153;16770;6630;34507;16307","16608;6630;5506;26679","23285;35048;37761","21903;24603"),
abundances = c("1;101;2;77;1","8;1700;1;1","270;1;3","99;1"))
# split the strings of data into lists
data[,s := strsplit(samples,split=";",fixed=TRUE)]
data[,a := strsplit(abundances,split=";",fixed=TRUE)]
data
# species samples abundances s a
#1: A 153;16770;6630;34507;16307 1;101;2;77;1 153,16770,6630,34507,16307 1,101,2,77,1
#2: B 16608;6630;5506;26679 8;1700;1;1 16608,6630,5506,26679 8,1700,1,1
#3: C 23285;35048;37761 270;1;3 23285,35048,37761 270,1,3
#4: D 21903;24603 99;1 21903,24603 99,1
I now need to fill a table that shows me how many individuals of a species show up in a given sample:
# we are only looking at these samples:
selected_samples <- data.table(sample = c("153","16770","6448","37761","5506","7852"))
# create output table...
df <- data.table(matrix(0, ncol = nrow(selected_samples) 1, nrow = nrow(data)))
#rename samples (put letter "s" before sample ID)
colnames(df) <- c("species",data[,selected_samples[,sample]])
df[,species:=data[,species]]
df
# species 153 16770 6448 37761 5506 7852
#1: A 0 0 0 0 0 0
#2: B 0 0 0 0 0 0
#3: C 0 0 0 0 0 0
#4: D 0 0 0 0 0 0
So I basically want to go through my data, and whenever a species
(i) is found in a sample
(j) which is one of the selected samples in df
I would like to fill the respective cell df[i,j]
with the abundance
shown in the data column 'abundances'.
I know I could easily loop through my data and fill my output table, but I would like to use data.table and loops or apply() just seem to make data.table very slow. Do you know of any other way to solve this?
CodePudding user response:
Given there are always as many abundances as samples, here are two ways:
data.table
data[, .(s = unlist(s),
a = unlist(a)),
by = species]
#> species s a
#> <char> <char> <char>
#> 1: A 153 1
#> 2: A 16770 101
#> 3: A 6630 2
#> 4: A 34507 77
#> 5: A 16307 1
#> 6: B 16608 8
#> 7: B 6630 1700
#> 8: B 5506 1
#> 9: B 26679 1
#> 10: C 23285 270
#> 11: C 35048 1
#> 12: C 37761 3
#> 13: D 21903 99
#> 14: D 24603 1
tidyr
{tidyr}s unnest function is specifically designed for this kind of task
tidyr::unnest(data, cols = c("s", "a"))
#> # A tibble: 14 × 5
#> species samples abundances s a
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A 153;16770;6630;34507;16307 1;101;2;77;1 153 1
#> 2 A 153;16770;6630;34507;16307 1;101;2;77;1 16770 101
#> 3 A 153;16770;6630;34507;16307 1;101;2;77;1 6630 2
#> 4 A 153;16770;6630;34507;16307 1;101;2;77;1 34507 77
#> 5 A 153;16770;6630;34507;16307 1;101;2;77;1 16307 1
#> 6 B 16608;6630;5506;26679 8;1700;1;1 16608 8
#> 7 B 16608;6630;5506;26679 8;1700;1;1 6630 1700
#> 8 B 16608;6630;5506;26679 8;1700;1;1 5506 1
#> 9 B 16608;6630;5506;26679 8;1700;1;1 26679 1
#> 10 C 23285;35048;37761 270;1;3 23285 270
#> 11 C 23285;35048;37761 270;1;3 35048 1
#> 12 C 23285;35048;37761 270;1;3 37761 3
#> 13 D 21903;24603 99;1 21903 99
#> 14 D 21903;24603 99;1 24603 1
Final Step
Either way, afterwards you can just reshape the data:
dcast.data.table(dt2, formula = species ~ s, fill = 0, value.var = "a")
#> Key: <species>
#> species 153 16307 16608 16770 21903 23285 24603 26679 34507 35048 37761 5506 6630
#> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
#> 1: A 1 1 0 101 0 0 0 0 77 0 0 0 2
#> 2: B 0 0 8 0 0 0 0 1 0 0 0 1 1700
#> 3: C 0 0 0 0 0 270 0 0 0 1 3 0 0
#> 4: D 0 0 0 0 99 0 1 0 0 0 0 0 0