How to find similar data points based on multiple conditions on certain columns in Pandas?-CodePudding

I have a dataset which consists of columns like 'DATE_TIME', 'ID', 'VALUE1', 'VALUE2', 'VALUE3', 'VALUE4', 'MODEL','SOLD', 'INSPECTION', 'MODE', 'TIME', 'CYCLE_PART'. ID column values are usually numbers, but there are aplhabetical values too.

import numpy as np
import pandas as pd
import random

df = pd.DataFrame({'DATE_TIME': pd.date_range('2022-11-01', '2022-11-06 23:00:00', freq='20min'),
                   'ID': [random.randrange(1, 20) for n in range(430)]})

df['VALUE1'] = [random.randrange(110, 140) for n in range(430)]
df['VALUE2'] = [random.randrange(50, 60) for n in range(430)]
df['VALUE3'] = [random.randrange(80, 100) for n in range(430)]
df['VALUE4'] = [random.randrange(30, 50) for n in range(430)]

df['MODEL'] = [random.randrange(1, 3) for n in range(430)]

df['SOLD'] = [random.randrange(0, 2) for n in range(430)]

df['INSPECTION'] = df['DATE_TIME'].dt.day

df['MODE'] = np.select([df['INSPECTION'] == 1, df['INSPECTION'].isin([2, 3])], ['A', 'B'], 'C')

df['TIME'] = df['DATE_TIME'].dt.time
# df['TIME'] = pd.to_timedelta(df['TIME'])
df['TIME'] = df['TIME'].astype('str')


# Create DAY Night columns only-------------------------------------------------------------------------
def cycle_day_period(dataframe: pd.DataFrame, midnight='00:00:00', start_of_morning='06:00:00',
                     start_of_afternoon='13:00:00',
                     start_of_evening='18:00:00', end_of_evening='23:00:00', start_of_night='24:00:00'):
    bins = [midnight, start_of_morning, start_of_afternoon, start_of_evening, end_of_evening, start_of_night]
    labels = ['Night', 'Morning', 'Morning', 'Night', 'Night']

    return pd.cut(
        pd.to_timedelta(dataframe),
        bins=list(map(pd.Timedelta, bins)),
        labels=labels, right=False, ordered=False
    )


df['CYCLE_PART'] = cycle_day_period(df['TIME'], '00:00:00', '06:00:00', '13:00:00', '18:00:00', '23:00:00', '24:00:00')

My Expectation: is to find most similar or same values among 'VALUE1', 'VALUE2', 'VALUE3', 'VALUE4 values. Meanwhile, Model column should be same, whereas SOLD is different.

For example, I have the followig data table:

id	VALUE1	VALUE2	VALUE3	VALUE4	MODE	SOLD
25	50	88	32	81	1	0
25	80	22	19	22	2	0
25	100	44	72	54	1	0
18	99	24	29	22	2	1
18	55	64	46	68	1	1
18	44	89	115	23	2	1

I would expect that 2nd and 4th rows are my output for mode 2, and 1st and 5th rows are my output for mode 1. How can I achieve this output? T tried multiple booleans, but ended up with errors.

CodePudding user response：

This should work, but it is really slow since it calculates the euclidean distance between all rows. But it should demonstrate the basic idea. If you want something faster, you can look into the matrixprofile library In smallest are then stored [[model_number1, smallest pair],[model_number2, smallest pair]]

import numpy as np
model_groups = df.groupby(by=['MODEL'])

def euclid_distance(a,b):
    return np.sqrt(np.sum((a-b)**2))


smallest = []
for group_name, df_model in model_groups:
    sold_1 = df_model.loc[df_model['SOLD']==1]
    sold_0 = df_model.loc[df_model['SOLD']==0]
    distances = []
    for _, row1 in sold_1.iterrows():
       
        for _, row2 in sold_0.iterrows():
            dist = euclid_distance(row1.loc[['VALUE1','VALUE2','VALUE3','VALUE4']],row2.loc[['VALUE1','VALUE2','VALUE3','VALUE4']])
            distances.append([row1, row2,  dist])
    
    s = sorted(distances, key=lambda x: x[2])
    smallest.append([group_name,s[0]])

So here smallest[0][1][0] and smallest[0][1][1] gives you the two rows which are the closest for MODEL 0

CodePudding user response：

Try this. Have not tested it.

import numpy as np
from scipy.spatial import distance_matrix
model_groups = df.groupby(by=['MODEL'])

smallest = []
for group_name, df_model in model_groups:
    sold_1 = df_model.loc[df_model['SOLD']==1]
    sold_0 = df_model.loc[df_model['SOLD']==0]
    values_1 = sold_1[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
    values_0 = sold_0[['VALUE1','VALUE2','VALUE3','VALUE4']].to_numpy()
    mtrx = distance_matrix(values_1, values_0)
    ij_min = np.unravel_index(mtrx.argmin(), mtrx.shape)
    smallest.append([group_name,sold_1.iloc[ij_min[0]], sold_0.iloc[ij_min[1]]])

id	VALUE1	VALUE2	VALUE3	VALUE4	MODE	SOLD
25	50	88	32	81	1	0
25	80	22	19	22	2	0
25	100	44	72	54	1	0
18	99	24	29	22	2	1
18	55	64	46	68	1	1
18	44	89	115	23	2	1

id	VALUE1	VALUE2	VALUE3	VALUE4	MODE	SOLD
25	50	88	32	81	1	0
25	80	22	19	22	2	0
25	100	44	72	54	1	0
18	99	24	29	22	2	1
18	55	64	46	68	1	1
18	44	89	115	23	2	1

id	VALUE1	VALUE2	VALUE3	VALUE4	MODE	SOLD
25	50	88	32	81	1	0
25	80	22	19	22	2	0
25	100	44	72	54	1	0
18	99	24	29	22	2	1
18	55	64	46	68	1	1
18	44	89	115	23	2	1