Home > Enterprise >  Comparing and filtering across multiple columns using string matching in R
Comparing and filtering across multiple columns using string matching in R

Time:06-02

I have a dataframe like so:

    TS    Device1.max   Device2.max   Device3.max       Device4.max
18:02:44     FALSE        FALSE          TRUE               FALSE
18:02:45     TRUE         TRUE          FALSE               FALSE
18:02:46     FALSE        FALSE         FALSE               TRUE
18:02:47     FALSE        FALSE         FALSE               FALSE
18:02:48     FALSE        FALSE         FALSE               FALSE
18:02:49     FALSE        FALSE         FALSE               FALSE
18:02:50     FALSE        FALSE         FALSE               FALSE
18:02:51     FALSE        FALSE         FALSE               FALSE
18:02:52     FALSE        FALSE         FALSE               TRUE
18:02:53     FALSE        TRUE          FALSE               FALSE
18:02:54     FALSE        FALSE         FALSE               FALSE

To get the true false columns I used the following code:

df$Device1.max = ifelse(df$Device1 == max(df$Device1), 'true','false')
df$Device2.max = ifelse(df$Device2 == max(df$Device2), 'true','false')
df$Device3.max = ifelse(df$Device3 == max(df$Device3), 'true','false')
df$Device4.max = ifelse(df$Device4 == max(df$Device4), 'true','false')

For simplicity I am only showing 4 Device columns. I have about a hundred device columns where I would like to do the comparison. It wont be feasible to specify the hundred columns in a hundred ifelse statements How do I compare using regex or specifying generic column name assuming all the device columns of interest will have some sort of name starting like device?

I then want to filter to or find the row where maximum Device.max columns satisfy the condition where it is TRUE within /-1 row of it. Algorithmically, I would create an index column and filter to a dataframe where only the TRUE values are present. Then I would check how many columns have indices within 1 row of each other. In the above case rows 1,2 & 3 have 4 columns satisfying the true condition, whereas rows 9 & 10 have only 2 columns satisfying the condition. Therefore my expected output would be:

     TS      Device1.max    Device2.max   Device3.max         Device4.max
    18:02:44     FALSE        FALSE          TRUE               FALSE
    18:02:45     TRUE         TRUE          FALSE               FALSE
    18:02:46     FALSE        FALSE         FALSE               TRUE

However this method seems very iterative and inefficient. Is there a better way to do it leveraging dataframe functions in R?

CodePudding user response:

This code should answer the first TRUE/FALSE question

r <- c();
colum <- c();
for (colu in 2:ncol(example_table)){
  example_table[ ,colu] <- example_table[ ,colu]==max(example_table[ ,colu]) # returns True/False
  val <- which(example_table[ ,colu]==T) # searching for row indexes
  r <- append(r,val) # append row indexes
  colum <- append(colum,rep(colu,length(val))) # since one column can contain more than one True, repeat that column index and append it
}
true_values <- cbind(r,colum) # just a matrix-like output

out:

> example_table
     V1 V2 V3 V4
1 18:02  5  8  1
2 14:05  7  1  7
3 19:27  7  6  1

# After for:

> example_table
     V1    V2    V3    V4
1 18:02 FALSE  TRUE FALSE
2 14:05  TRUE FALSE  TRUE
3 19:27  TRUE FALSE FALSE

> true_values
     r colum
[1,] 2     2
[2,] 3     2
[3,] 1     3
[4,] 2     4

where r is the row index and colum is the column index which contains true values. Note that example_table[ ,colu]==max(example_table[ ,colu]) returns a TRUE/FALSE value, and note that colum <- append(colum,rep(colu,length(val))) is needed to avoid future dimensional problems too.

For the second question, you have now row indexes which contains TRUE values. You can then implement a code that when rows above and under the selected row contains a TRUE value (any() function should be a good one), select that row. Then subset() original data frame with this row indexes.

  • Related