Home > other >  How to compare each date in a cell with all the dates in a column
How to compare each date in a cell with all the dates in a column

Time:07-29

I have a dataframe with three columns lets say

Name  Address Date
faraz  xyz    2022-01-01
Abdul  abc    2022-06-06
Zara   qrs    2021-02-25

I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.

for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but, 
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit 
so we will drop that row.

Desired Output:

Name  Address Date
faraz  xyz    2022-01-01
Abdul  abc    2022-06-06

I have tried a couple of approches such a nested loops, but I got 1 million entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.

for index, row in dupes_df.iterrows():

    for date in uniq_dates_list:

        format_date = datetime.strptime(date,'%d/%m/%y')

        if (( format_date.year - row['JournalDate'].year ) * 12   ( format_date.month - row['JournalDate'].month ) <= 6):

            print("here here")
            break
        else:
            dupes_df.drop(index, inplace=True)

I need a much more omptimal solution for it. Studied about lamba functions, but couldn't get to the depths of it.

CodePudding user response:

IIUC, this should work for you:

import pandas as pd
import itertools
from io import StringIO

data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])

df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]

Output:

    Name Address       Date
0  faraz     xyz 2022-01-01
1  Abdul     abc 2022-06-06

CodePudding user response:

First I think it's be easier if you use 'relativedelta' from 'dateutil'. Reff: https://pynative.com/python-difference-between-two-dates-in-months/

second, I think you need add coloumn. lets call it 'score'

at the second loop, if delta <= 6 month : set score = 1 and 'continue' This way each row is compared to all rows.

delete all row that score == 0.

  • Related