Home > Software engineering >  Bin multiple columns in a data.table with respect to values in another column
Bin multiple columns in a data.table with respect to values in another column

Time:03-01

I have a data.table that has N=3 columns. For simplicity, let's assume the first column is the expected value, and the subsequent columns are observation of that particular value in each samples.

Example of the data.table:

experiment obs_s1 obs_s2
1 1 1
2 2 2
3 3 3
4 2 2
5 2 4
6 4 4
7 5 4
8 4 4
9 3 3
10 4 4

The challenge is to bin the experiments into arbitrary intervals and apply a particular function to the observation from samples. Let's assume, finding the maximum value and sum of all observations in the custom defined range.

Since for making a new data.table the code looks complex, first I used melt.data.table:

experiment sample count
1   obs_s1  1
2   obs_s1  2
3   obs_s1  3
4   obs_s1  2
5   obs_s1  2
6   obs_s1  4
7   obs_s1  5
8   obs_s1  4
9   obs_s1  3
10  obs_s1  4
1   obs_s2  1
2   obs_s2  2
3   obs_s2  3
4   obs_s2  2
5   obs_s2  4
6   obs_s2  4
7   obs_s2  4
8   obs_s2  4
9   obs_s2  3
10  obs_s2  4

And tried to use cut, but that only applies to one column that is count but I cannot really apply a function with lapplyor similar to the output. Example:

experiment = c(0,4,8,10)
labels = c('bin_1', 'bin_2', 'bin_3')
my_dt_melt <- my_dt_melt[, bins":= cut(insert_size, breaks = insert_size_intervals, labels = labels)]

What I'm doing now is just using a for loop and extracting values and creating a new data.table, but it is not utilizing the power of data.table

The output I'd like to achieve from this melted data.table is the following:

  1. Example of output for max values:
range sample value
1-4 obs_s1 3
1-4 obs_s2 3
5-8 obs_s1 5
5-8 obs_s2 4
9-10 obs_s1 4
9-10 obs_s2 4
  1. Example output for sum of values:
range sample value
1-4 obs_s1 8
1-4 obs_s2 8
5-8 obs_s1 15
5-8 obs_s2 16
9-10 obs_s1 7
9-10 obs_s2 7

So the question is: How to bin a column in a melted data.table w.r.t to first column and apply a function on it?

CodePudding user response:

You can put functions/expressions in the by argument:

my_dt_melt[, list(maxcount = max(count), sumcount = sum(count)),
   by = .(
     range = cut(
       experiment,
       c(0,4,8,10),
       labels = c('bin_1', 'bin_2', 'bin_3')),
     sample
)]
#      range sample maxcount sumcount
# 1:   bin_1 obs_s1        3        8
# 2:   bin_2 obs_s1        5       15
# 3:   bin_3 obs_s1        4        7
# 4:   bin_1 obs_s2        3        8
# 5:   bin_2 obs_s2        4       16
# 6:   bin_3 obs_s2        4        7
  • Related