Home > OS >  pandas how to drop duplicated rows based on conditions
pandas how to drop duplicated rows based on conditions

Time:11-25

I have a df ,you can have by this code:

import numpy as np
import pandas as pd
from io import StringIO
dfs = """
    contract  RB RateCompany gs  IssueDate  ValIssueDate   ToDate1  FromDate1
84  GA16      77           T  G   19940701    19480101.0  197702.0   190001.0
85  GA16      77           T  G   19940701    19980101.0  999999.0   197703.0

"""
df = pd.read_csv(StringIO(dfs.strip()), sep='\s ', 
                  dtype={"RB": int}
                  )
df

Out put:

contract    RB  RateCompany gs  IssueDate   ValIssueDate    ToDate1    FromDate1
84  GA16    77  T           G   19940701    19480101.0      197702.0    190001.0
85  GA16    77  T           G   19940701    19980101.0      999999.0    197703.0

For this df, a set of contract with RB ,is unique,it means only 1 row should be left,and the condition is:

FromDate1<=df.IssueDate<=ToDate1

So I tired:

df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
                 (df['IssueDate'] <= df['ToDate1']) &
                 (df['IssueDate'] >= df['FromDate1']))]

But the output is blank:

contract    RB  RateCompany gs  IssueDate   ValIssueDate    ToDate1 FromDate1

The expected output should be:

contract    RB  RateCompany gs  IssueDate   ValIssueDate    ToDate1    FromDate1

85  GA16    77  T           G   19940701    19980101.0      999999.0    197703.0

any friend can help?

CodePudding user response:

Divide your IssueDate by 100, probably a problem of unit from time conversion:

>>> df.loc[df['IssueDate'].div(100).between(df['FromDate1'], df['ToDate1'])]
   contract  RB RateCompany gs  IssueDate  ValIssueDate   ToDate1  FromDate1
85     GA16  77           T  G   19940701    19980101.0  999999.0   197703.0

Check it:

>>> df.loc[85, ['IssueDate', 'FromDate1', 'ToDate1']].astype(int)

IssueDate    19940701
FromDate1      197703
ToDate1        999999
Name: 85, dtype: int64

CodePudding user response:

Lets discuss your answer first

df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
                 (df['IssueDate'] <= df['ToDate1']) &
                 (df['IssueDate'] >= df['FromDate1']))]

Sure this code will give you blank result. It might be because of different data types of the columns.

First make sure those columns (that you want to compare have same datatype) by df.dtypes if the data you want to compare is date, compare with same format. Example if you want to compare "YYYYMM" you should compare it with date with format "YYYYMM"

  • Related