I have a data.table
that has N=3 columns. For simplicity, let's assume the first column is the expected value, and the subsequent columns are observation of that particular value in each samples.
Example of the data.table
:
experiment obs_s1 obs_s2
1 1 1
2 2 2
3 3 3
4 2 2
5 2 4
6 4 4
7 5 4
8 4 4
9 3 3
10 4 4
The challenge is to bin the experiments into arbitrary intervals and apply a particular function to the observation from samples. Let's assume, finding the maximum value and sum of all observations in the custom defined range.
Since for making a new data.table
the code looks complex, first I used melt.data.table
:
experiment sample count
1 obs_s1 1
2 obs_s1 2
3 obs_s1 3
4 obs_s1 2
5 obs_s1 2
6 obs_s1 4
7 obs_s1 5
8 obs_s1 4
9 obs_s1 3
10 obs_s1 4
1 obs_s2 1
2 obs_s2 2
3 obs_s2 3
4 obs_s2 2
5 obs_s2 4
6 obs_s2 4
7 obs_s2 4
8 obs_s2 4
9 obs_s2 3
10 obs_s2 4
And tried to use cut
, but that only applies to one column that is count
but I cannot really apply a function with lapply
or similar to the output. Example:
experiment = c(0,4,8,10)
labels = c('bin_1', 'bin_2', 'bin_3')
my_dt_melt <- my_dt_melt[, bins":= cut(insert_size, breaks = insert_size_intervals, labels = labels)]
What I'm doing now is just using a for loop and extracting values and creating a new data.table
, but it is not utilizing the power of data.table
The output I'd like to achieve from this melted data.table
is the following:
- Example of output for max values:
range sample value
1-4 obs_s1 3
1-4 obs_s2 3
5-8 obs_s1 5
5-8 obs_s2 4
9-10 obs_s1 4
9-10 obs_s2 4
- Example output for sum of values:
range sample value
1-4 obs_s1 8
1-4 obs_s2 8
5-8 obs_s1 15
5-8 obs_s2 16
9-10 obs_s1 7
9-10 obs_s2 7
So the question is: How to bin a column in a melted data.table
w.r.t to first column and apply a function on it?
CodePudding user response:
You can put functions/expressions in the by
argument:
my_dt_melt[, list(maxcount = max(count), sumcount = sum(count)),
by = .(
range = cut(
experiment,
c(0,4,8,10),
labels = c('bin_1', 'bin_2', 'bin_3')),
sample
)]
# range sample maxcount sumcount
# 1: bin_1 obs_s1 3 8
# 2: bin_2 obs_s1 5 15
# 3: bin_3 obs_s1 4 7
# 4: bin_1 obs_s2 3 8
# 5: bin_2 obs_s2 4 16
# 6: bin_3 obs_s2 4 7