Home > Net >  How to find similar data points based on multiple conditions on certain columns in Pandas?
How to find similar data points based on multiple conditions on certain columns in Pandas?

Time:01-05

I have a dataset which consists of columns like 'DATE_TIME', 'ID', 'VALUE1', 'VALUE2', 'VALUE3', 'VALUE4', 'MODEL','SOLD', 'INSPECTION', 'MODE', 'TIME', 'CYCLE_PART'. ID column values are usually numbers, but there are aplhabetical values too.

import numpy as np
import pandas as pd
import random

df = pd.DataFrame({'DATE_TIME': pd.date_range('2022-11-01', '2022-11-06 23:00:00', freq='20min'),
                   'ID': [random.randrange(1, 20) for n in range(430)]})

df['VALUE1'] = [random.randrange(110, 140) for n in range(430)]
df['VALUE2'] = [random.randrange(50, 60) for n in range(430)]
df['VALUE3'] = [random.randrange(80, 100) for n in range(430)]
df['VALUE4'] = [random.randrange(30, 50) for n in range(430)]

df['MODEL'] = [random.randrange(1, 3) for n in range(430)]

df['SOLD'] = [random.randrange(0, 2) for n in range(430)]

df['INSPECTION'] = df['DATE_TIME'].dt.day

df['MODE'] = np.select([df['INSPECTION'] == 1, df['INSPECTION'].isin([2, 3])], ['A', 'B'], 'C')

df['TIME'] = df['DATE_TIME'].dt.time
# df['TIME'] = pd.to_timedelta(df['TIME'])
df['TIME'] = df['TIME'].astype('str')


# Create DAY Night columns only-------------------------------------------------------------------------
def cycle_day_period(dataframe: pd.DataFrame, midnight='00:00:00', start_of_morning='06:00:00',
                     start_of_afternoon='13:00:00',
                     start_of_evening='18:00:00', end_of_evening='23:00:00', start_of_night='24:00:00'):
    bins = [midnight, start_of_morning, start_of_afternoon, start_of_evening, end_of_evening, start_of_night]
    labels = ['Night', 'Morning', 'Morning', 'Night', 'Night']

    return pd.cut(
        pd.to_timedelta(dataframe),
        bins=list(map(pd.Timedelta, bins)),
        labels=labels, right=False, ordered=False
    )


df['CYCLE_PART'] = cycle_day_period(df['TIME'], '00:00:00', '06:00:00', '13:00:00', '18:00:00', '23:00:00', '24:00:00')

My Expectation: is to find most similar or same values among 'VALUE1', 'VALUE2', 'VALUE3', 'VALUE4 values. Meanwhile, Model column should be same, whereas SOLD is different.

For example, I have the followig data table:

id VALUE1 VALUE2 VALUE3 VALUE4 MODE SOLD
25 50 88 32 81 1 0
25 80 22 19 22 2 0
25 100 44 72 54 1 0
18 99 24 29 22 2 1
18 55 64 46 68 1 1
18 44 89 115 23 2 1

I would expect that 2nd and 4th rows are my output for mode 2, and 1st and 5th rows are my output for mode 1. How can I achieve this output? T tried multiple booleans, but ended up with errors.

CodePudding user response:

This should work, but it is really slow since it calculates the euclidean distance between all rows. But it should demonstrate the basic idea. If you want something faster, you can look into the matrixprofile library In smallest are then stored [[model_number1, smallest pair],[model_number2, smallest pair]]

import numpy as np
model_groups = df.groupby(by=['MODEL'])

def euclid_distance(a,b):
    return np.sqrt(np.sum((a-b)**2))


smallest = []
for group_name, df_model in model_groups:
    sold_1 = df_model.loc[df_model['SOLD']==1]
    sold_0 = df_model.loc[df_model['SOLD']==0]
    distances = []
    for _, row1 in sold_1.iterrows():
       
        for _, row2 in sold_0.iterrows():
            dist = euclid_distance(row1.loc[['VALUE1','VALUE2','VALUE3','VALUE4']],row2.loc[['VALUE1','VALUE2','VALUE3','VALUE4']])
            distances.append([row1, row2,  dist])
    
    s = sorted(distances, key=lambda x: x[2])
    smallest.append([group_name,s[0]])
   

So here smallest[0][1][0] and smallest[0][1][1] gives you the two rows which are the closest for MODEL 0

CodePudding user response:

Try this. Have not tested it.

import numpy as np
from scipy.spatial import distance_matrix
model_groups = df.groupby(by=['MODEL'])

smallest = []
for group_name, df_model in model_groups:
    sold_1 = df_model.loc[df_model['SOLD']==1]
    sold_0 = df_model.loc[df_model['SOLD']==0]
    values_1 = sold_1[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
    values_0 = sold_0[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
    mtrx = distance_matrix(values_1, values_0)
    ij_min = np.unravel_index(mtrx.argmin(), mtrx.shape)
    smallest.append([group_name,sold_1.iloc[ij_min[0]], sold_0.iloc[ij_min[1]]])
  • Related