Home > Software design >  iterating over two dataframes at the same time with if statement
iterating over two dataframes at the same time with if statement

Time:11-28

i have two dataframes, i wanna iterate through the first one, and if a condition is checked, I move to the second dataframe and see if another condition is checked (in the same rowxcolumn as the first dataframe)

this would be dataframe 1

id 1 2 3
0 XY00 AB80 XY01
1 FY34 XY60 XY91
2 AB46 AC40 NY23
3 XY70 AB23 DG60

this would be dataframe 2, they have the same id and idx, but different column names, although the same length

id e1 e2 e3
0 2003-12-09 2005-01-01 2006-12-14
1 2004-11-09 2002-01-01 1999-07-10
2 2012-02-13 2011-08-22 2003-03-16
3 2003-01-17 2005-01-01 2017-09-30

the ideal output would be

id 1 2 3
0 XY00 AB80 XY01
3 XY70 AB23 DG60

so only the values from dataframe 1 that start with 'XY' and that are older than '2003-01-01' in the corresponding column in dataframe 2

i try this for loop, but it outputs an empty dataframe

new_df = pd.DataFrame(data = None, columns = df1.columns)

for ind, row in df1.iterrows():
    if ((ind,row) == ("XY00")):
        for ind2, row2 in df2.iterrows():
             if((ind2,row2) >= ("2003-01-01")):
                    new_df = new_df.append(row)

CodePudding user response:

Here is a way that avoids using iterating over values.

check for values that start with XY

d1 = df1.stack().str.startswith('XY').unstack()

check for date values that are after 2003-01-01. If dataframe already contains all datetime dtypes,.apply(pd.to_datetime,errors='coerce') can be removed from below.

d2 = df2.apply(pd.to_datetime,errors='coerce').set_axis(df1.columns,axis=1).gt(pd.to_datetime('2003-01-01'))

Find where at least one value in each row in both dataframes are True

m = (d1 & d2).any(axis=1)

Use the above to filter the original df1

filtered_df = df1.loc[m]

Output:

       1     2     3
id                  
0   XY00  AB80  XY01
3   XY70  AB23  DG60

CodePudding user response:

can you try this:

#df1=df1.set_index('id')
#df2=df2.set_index('id')

final=df1.merge(df2,left_index=True, right_index=True) #merge df1 and df2 on index
print(final)
'''
      1     2     3         e1         e2         e3
0  XY00  AB80  XY01 2003-12-09 2005-01-01 2006-12-14
1  FY34  XY60  XY91 2004-11-09 2002-01-01 1999-07-10
2  AB46  AC40  NY23 2012-02-13 2011-08-22 2003-03-16
3  XY70  AB23  DG60 2003-01-17 2005-01-01 2017-09-30
'''

dictionary = dict(zip(df1.columns, df2.columns)) #I'm converting the columns to a dictonary to be key-value.
print(dictionary)
'''
{'1': 'e1', '2': 'e2', '3': 'e3'}
'''

Now i loop through the specified conditions for each column matches and save it in a list. I'm converting this list of dataframes to a single dataframe with concat.

df = pd.concat([final[(final[k].str.contains('XY')) & (final[v] >= '2003-01-01')] for k,v in dictionary.items()]).drop_duplicates()
df = df[df1.columns]
print(df)
'''
    1         2      3
0   XY00    AB80    XY01
3   XY70    AB23    DG60

'''

  • Related