Home > OS >  calculate mean (or uniquely mark) groups that match critearia
calculate mean (or uniquely mark) groups that match critearia

Time:10-26

This goes back to a question I've asked some time ago - I'm still struggling in the same area.

I have a data.table with some values ('value'), lower limits ('min_val') and upper limits ('max_val'):

   | value | min_val | max_val |
1: | 94.001 | 94.00 | 94.02 |
2: | 94.002 | 94.00 | 94.03 |
3: | 94.003 | 94.01 | 94.04 |
4: | 95 | 94.98 | 95.02 |
5: | 100.00 | 99.00 | 101.00 |
6: | 100.10 | 99.10 | 101.00 |
7: | 200.00 | 199.00 | 201.00 |
8: | 200.10 | 199.00 | 201.00 |

With your help I have the count:

   | value | min_val | max_val | count | id |
1: | 94.001 | 94.00 | 94.02 |  1       |  1 |
2: | 94.002 | 94.00 | 94.03 |  2       |  2 |
3: | 94.003 | 94.01 | 94.04 |  2       |  2 |
4: | 95     | 94.98 | 95.02 |  1       |  3 |
5: | 100.00 | 99.00 | 101.00 | 2       |  4 |
6: | 100.10 | 99.10 | 101.00 | 2       |  4 |
7: | 200.00 | 199.00 | 201.00 | 2      |  5 |
8: | 200.10 | 199.00 | 201.00 | 2      |  5 |

Now I want to uniquely identify (col id) each "count group" so that I can use the identified later on (by=id). I've tried calculating the mean of 'value' column hoping I'll get unique identifiers but the mean function returns 1/count. Not sure how to proceed - I'm stuck as I don't understand how to "go backwards" while doing the computations (example: for 94.003, I need to compare the next element of column value to it and also the previous element):

dat[, count := mapply(function(mi,ma) mean(mi < value & value < ma), min_val, max_val)]

input:

    library(data.table)
dat <- setDT(structure(list(value = c(94.01, 94.02, 94.03, 95, 100, 100.1, 200, 200.1), min_val = c(94, 94, 94.01, 94.98, 99, 99.1, 199, 199), max_val = c(94.02, 94.03, 94.04, 95.02, 101, 101, 201, 201)), class = c("data.table", "data.frame"), row.names = c(NA, -4L)))
dat[, count := mapply(function(mi,ma) sum(mi < value & value < ma), min_val, max_val)]

I'm struggling with this for a few days. The only thing I could think of is that I need to modify the method that does the count so that I do the marking the same way. I did not find a solution. I use this method to generate the count

dat[, count := mapply(function(mi,ma) sum(mi < value & value < ma), min_val, max_val)]

CodePudding user response:

You can use data.table::rleid:

library(data.table)
setDT(df)[, id:=rleid(count)]

Output:

    value min_val max_val count    id
    <num>   <num>   <num> <int> <int>
1: 94.001   94.00   94.02     1     1
2: 94.002   94.00   94.03     2     2
3: 94.003   94.01   94.04     2     2
4: 95.000   94.98   95.02     1     3

Input:

structure(list(value = c(94.001, 94.002, 94.003, 95), min_val = c(94, 
94, 94.01, 94.98), max_val = c(94.02, 94.03, 94.04, 95.02), count = c(1L, 
2L, 2L, 1L)), row.names = c(NA, -4L), class = "data.frame")

CodePudding user response:

This isn't a complete answer because your goals are currently unclear, but I think this will provide you some tools with which to get there.

You have your data (note that the data provided is different from the tables in the question)

dat <- setDT(structure(list(value = c(94.01, 94.02, 94.03, 95, 100, 100.1, 200, 200.1), min_val = c(94, 94, 94.01, 94.98, 99, 99.1, 199, 199), max_val = c(94.02, 94.03, 94.04, 95.02, 101, 101, 201, 201)), class = c("data.table", "data.frame"), row.names = c(NA, -4L)))

You have a column of numeric values, and two columns representing the lower and upper boundaries of a range. A count can be done per row either with respect to the value or the range. If the count is per value, then it tells us how many different rows have a range that the value on this row fits into. If the count is per range, then it will tell us how many different rows total have a value that fits into this range.

The easiest way to think about it is to merge your example according to the rules while providing yourself with all of the merged columns to look at and think about. If we do a self-merge with data.table, we have x (on the outside) and i (on the inside). We can label the columns respectively.

dat[dat,
 on = .(value > min_val, value < max_val),
 .(x.value, x.min_val, x.max_val, i.value, i.min_val, i.max_val)]

We can temporarily save this into another variable while we're figuring out what we want.

merged <- dat[dat,
 on = .(value > min_val, value < max_val),
 .(x.value, x.min_val, x.max_val, i.value, i.min_val, i.max_val)]

Then we can, for example, either get the count within the row's range, or the ranges that the value falls into by deciding what columns we want to group together. If we group together the i columns, they were the min_val and max_val comparison, so we can count the number of values within the range. If we group together the x columns, we look at the value in the row for our count.

merged[,
       .(`Count Within This Range` = .N),
       .(i.value, i.min_val, i.max_val)]
merged[,
       .(`Ranges Value Falls Into` = .N),
       .(x.value, x.min_val, x.max_val)]

If I'm reading between the lines, I think you want a group ID that tells you which items end up falling within the same range sequentially, but you're going to have to be more explicit about what precisely should generate the id column. Hopefully being able to look at it like this will help out!

  • Related