Home > Enterprise >  Comparing Two Dataframes to See if One Sits Inside the Range of the Other's columns
Comparing Two Dataframes to See if One Sits Inside the Range of the Other's columns

Time:01-09

I have 2 dataframes as follows:

import pandas as pd
import itertools

df_test = {
    "col1": [1, 5, 6, 5],
    "col2": [4, 5, 6, 7],
    "col3": [5, 9, 10, 11],
    "col4": [11, 10, 10, 4],
}
df_train = {
    "col1": [3, 6, 7, 4, 4],
    "col2": [5, 10, 5, 5, 5],
    "col3": [6, 8, 9, 10, 6],
    "col4": [11, 10, 8, 12, 6],
}
df_train = pd.DataFrame(data=df_train)
df_test = pd.DataFrame(data=df_test)

I have already found the min and max values of df_train and want to see if df_test sits within these values, I want to iterate through each row, continually changing the min and max values as given for each column of the training set. I continually compare each value in the df_test row to see if it fits inside the min and max. If yes, then i will insert True to a list, else I'll write False.

# find minMax
df_train_minMax = df_train.agg([min, max])

# identify no of columns and rows in test data
columns = len(df_test.columns)
rows = len(df_test.index)
df_train_minMax looks like:

     col1  col2  col3  col4
min     3     5     6     6
max     7    10    10    12

I have made 2 lists - list receives True or False values per row of df_test. list2 receives the index location of the False values. And then I've made a loop to iterate through df_test, although its becoming very complicated to read. I am certain there is a better and simpler way of doing this.

list = []
list2 = []

def inRange(min, max):
    x = 0
    # i = index, j = columns
    for i, j in itertools.product(range(rows), range(columns)):
        if df_test.iloc[i, j] >= min and df_test.iloc[i, j] <= max:
            x = x   1
        else:
            x = x   0
            list2.append((i, j))
        # print(x)
    if x == columns:
        list.insert(i, "True")
    else:
        list.insert(i, "False")

# calling function would need the values to continually change per min/max of each column
inRange(5, 10)

print(list) 
print(list2)  

Expected:

list = [False, True, True, False] list2 = [(0, 0), (0, 1), (0, 3), (3, 2), (3, 3)] or similar

calling function should continually update the min and max values in accordance with the columns in df_train_minMax, which is what I'm struggling to code.

CodePudding user response:

Replace values of each row using mask and setting as conditional numbers less or equal and greater or equal than x that is the value in a dataframe you iterate

>>> df_mask = df_test.mask(lambda x: (x>=5) & (x<=10))
>>> df_mask.values.tolist() 
[[1.0, 4.0, nan, 11.0], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, 11.0, 4.0]]

At this point pass through each element and if just contains True values means that you insert "True" because that entire row match for your condition between min and max numbers (5, 10)

>>> min_num, max_num = 5, 10
>>> df_mask = df_test.mask(lambda x: (x>=min_num) & (x<=max_num))
>>> df_mask
   col1  col2  col3  col4
0   1.0   4.0   NaN  11.0
1   NaN   NaN   NaN   NaN
2   NaN   NaN   NaN   NaN
3   NaN   NaN  11.0   4.0
>>> df_mask = df_mask.isnull()
>>> df_mask
    col1   col2   col3   col4
0  False  False   True  False
1   True   True   True   True
2   True   True   True   True
3   True   True  False  False
>>> df_mask = df_mask.values.tolist()
>>> df_mask
[[False, False, True, False], [True, True, True, True], [True, True, True, True], [True, True, False, False]]
# ...
>>> list2 = ["True" if (len(set(i)) == 1 and list(set(i))[0] == True) else "False" for i in df_mask]
>>> list2
['False', 'True', 'True', 'False']
  • Related