I have 2 dataframes as follows:
import pandas as pd
import itertools
df_test = {
"col1": [1, 5, 6, 5],
"col2": [4, 5, 6, 7],
"col3": [5, 9, 10, 11],
"col4": [11, 10, 10, 4],
}
df_train = {
"col1": [3, 6, 7, 4, 4],
"col2": [5, 10, 5, 5, 5],
"col3": [6, 8, 9, 10, 6],
"col4": [11, 10, 8, 12, 6],
}
df_train = pd.DataFrame(data=df_train)
df_test = pd.DataFrame(data=df_test)
I have already found the min and max values of df_train and want to see if df_test sits within these values, I want to iterate through each row, continually changing the min and max values as given for each column of the training set. I continually compare each value in the df_test row to see if it fits inside the min and max. If yes, then i will insert True to a list, else I'll write False.
# find minMax
df_train_minMax = df_train.agg([min, max])
# identify no of columns and rows in test data
columns = len(df_test.columns)
rows = len(df_test.index)
df_train_minMax looks like:
col1 col2 col3 col4
min 3 5 6 6
max 7 10 10 12
I have made 2 lists - list receives True or False values per row of df_test. list2 receives the index location of the False values. And then I've made a loop to iterate through df_test, although its becoming very complicated to read. I am certain there is a better and simpler way of doing this.
list = []
list2 = []
def inRange(min, max):
x = 0
# i = index, j = columns
for i, j in itertools.product(range(rows), range(columns)):
if df_test.iloc[i, j] >= min and df_test.iloc[i, j] <= max:
x = x 1
else:
x = x 0
list2.append((i, j))
# print(x)
if x == columns:
list.insert(i, "True")
else:
list.insert(i, "False")
# calling function would need the values to continually change per min/max of each column
inRange(5, 10)
print(list)
print(list2)
Expected:
list = [False, True, True, False] list2 = [(0, 0), (0, 1), (0, 3), (3, 2), (3, 3)] or similar
calling function should continually update the min and max values in accordance with the columns in df_train_minMax, which is what I'm struggling to code.
CodePudding user response:
Replace values of each row using mask
and setting as conditional numbers less or equal and greater or equal than x that is the value in a dataframe you iterate
>>> df_mask = df_test.mask(lambda x: (x>=5) & (x<=10))
>>> df_mask.values.tolist()
[[1.0, 4.0, nan, 11.0], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, 11.0, 4.0]]
At this point pass through each element and if just contains True
values means that you insert "True"
because that entire row match for your condition between min and max numbers (5, 10)
>>> min_num, max_num = 5, 10
>>> df_mask = df_test.mask(lambda x: (x>=min_num) & (x<=max_num))
>>> df_mask
col1 col2 col3 col4
0 1.0 4.0 NaN 11.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN 11.0 4.0
>>> df_mask = df_mask.isnull()
>>> df_mask
col1 col2 col3 col4
0 False False True False
1 True True True True
2 True True True True
3 True True False False
>>> df_mask = df_mask.values.tolist()
>>> df_mask
[[False, False, True, False], [True, True, True, True], [True, True, True, True], [True, True, False, False]]
# ...
>>> list2 = ["True" if (len(set(i)) == 1 and list(set(i))[0] == True) else "False" for i in df_mask]
>>> list2
['False', 'True', 'True', 'False']