iterating over two dataframes at the same time with if statement-CodePudding

i have two dataframes, i wanna iterate through the first one, and if a condition is checked, I move to the second dataframe and see if another condition is checked (in the same rowxcolumn as the first dataframe)

this would be dataframe 1

id	1	2	3
0	XY00	AB80	XY01
1	FY34	XY60	XY91
2	AB46	AC40	NY23
3	XY70	AB23	DG60

this would be dataframe 2, they have the same id and idx, but different column names, although the same length

id	e1	e2	e3
0	2003-12-09	2005-01-01	2006-12-14
1	2004-11-09	2002-01-01	1999-07-10
2	2012-02-13	2011-08-22	2003-03-16
3	2003-01-17	2005-01-01	2017-09-30

the ideal output would be

id	1	2	3
0	XY00	AB80	XY01
3	XY70	AB23	DG60

so only the values from dataframe 1 that start with 'XY' and that are older than '2003-01-01' in the corresponding column in dataframe 2

i try this for loop, but it outputs an empty dataframe

new_df = pd.DataFrame(data = None, columns = df1.columns)

for ind, row in df1.iterrows():
    if ((ind,row) == ("XY00")):
        for ind2, row2 in df2.iterrows():
             if((ind2,row2) >= ("2003-01-01")):
                    new_df = new_df.append(row)

CodePudding user response：

Here is a way that avoids using iterating over values.

check for values that start with XY

d1 = df1.stack().str.startswith('XY').unstack()

check for date values that are after 2003-01-01. If dataframe already contains all datetime dtypes,.apply(pd.to_datetime,errors='coerce') can be removed from below.

d2 = df2.apply(pd.to_datetime,errors='coerce').set_axis(df1.columns,axis=1).gt(pd.to_datetime('2003-01-01'))

Find where at least one value in each row in both dataframes are True

m = (d1 & d2).any(axis=1)

Use the above to filter the original df1

filtered_df = df1.loc[m]

Output:

       1     2     3
id                  
0   XY00  AB80  XY01
3   XY70  AB23  DG60

CodePudding user response：

can you try this:

#df1=df1.set_index('id')
#df2=df2.set_index('id')

final=df1.merge(df2,left_index=True, right_index=True) #merge df1 and df2 on index
print(final)
'''
      1     2     3         e1         e2         e3
0  XY00  AB80  XY01 2003-12-09 2005-01-01 2006-12-14
1  FY34  XY60  XY91 2004-11-09 2002-01-01 1999-07-10
2  AB46  AC40  NY23 2012-02-13 2011-08-22 2003-03-16
3  XY70  AB23  DG60 2003-01-17 2005-01-01 2017-09-30
'''

dictionary = dict(zip(df1.columns, df2.columns)) #I'm converting the columns to a dictonary to be key-value.
print(dictionary)
'''
{'1': 'e1', '2': 'e2', '3': 'e3'}
'''

Now i loop through the specified conditions for each column matches and save it in a list. I'm converting this list of dataframes to a single dataframe with concat.

df = pd.concat([final[(final[k].str.contains('XY')) & (final[v] >= '2003-01-01')] for k,v in dictionary.items()]).drop_duplicates()
df = df[df1.columns]
print(df)
'''
    1         2      3
0   XY00    AB80    XY01
3   XY70    AB23    DG60

'''