i have two dataframes, i wanna iterate through the first one, and if a condition is checked, I move to the second dataframe and see if another condition is checked (in the same rowxcolumn as the first dataframe)
this would be dataframe 1
id | 1 | 2 | 3 |
---|---|---|---|
0 | XY00 | AB80 | XY01 |
1 | FY34 | XY60 | XY91 |
2 | AB46 | AC40 | NY23 |
3 | XY70 | AB23 | DG60 |
this would be dataframe 2, they have the same id and idx, but different column names, although the same length
id | e1 | e2 | e3 |
---|---|---|---|
0 | 2003-12-09 | 2005-01-01 | 2006-12-14 |
1 | 2004-11-09 | 2002-01-01 | 1999-07-10 |
2 | 2012-02-13 | 2011-08-22 | 2003-03-16 |
3 | 2003-01-17 | 2005-01-01 | 2017-09-30 |
the ideal output would be
id | 1 | 2 | 3 |
---|---|---|---|
0 | XY00 | AB80 | XY01 |
3 | XY70 | AB23 | DG60 |
so only the values from dataframe 1 that start with 'XY' and that are older than '2003-01-01' in the corresponding column in dataframe 2
i try this for loop, but it outputs an empty dataframe
new_df = pd.DataFrame(data = None, columns = df1.columns)
for ind, row in df1.iterrows():
if ((ind,row) == ("XY00")):
for ind2, row2 in df2.iterrows():
if((ind2,row2) >= ("2003-01-01")):
new_df = new_df.append(row)
CodePudding user response:
Here is a way that avoids using iterating over values.
check for values that start with XY
d1 = df1.stack().str.startswith('XY').unstack()
check for date values that are after 2003-01-01
. If dataframe already contains all datetime dtypes,.apply(pd.to_datetime,errors='coerce')
can be removed from below.
d2 = df2.apply(pd.to_datetime,errors='coerce').set_axis(df1.columns,axis=1).gt(pd.to_datetime('2003-01-01'))
Find where at least one value in each row in both dataframes are True
m = (d1 & d2).any(axis=1)
Use the above to filter the original df1
filtered_df = df1.loc[m]
Output:
1 2 3
id
0 XY00 AB80 XY01
3 XY70 AB23 DG60
CodePudding user response:
can you try this:
#df1=df1.set_index('id')
#df2=df2.set_index('id')
final=df1.merge(df2,left_index=True, right_index=True) #merge df1 and df2 on index
print(final)
'''
1 2 3 e1 e2 e3
0 XY00 AB80 XY01 2003-12-09 2005-01-01 2006-12-14
1 FY34 XY60 XY91 2004-11-09 2002-01-01 1999-07-10
2 AB46 AC40 NY23 2012-02-13 2011-08-22 2003-03-16
3 XY70 AB23 DG60 2003-01-17 2005-01-01 2017-09-30
'''
dictionary = dict(zip(df1.columns, df2.columns)) #I'm converting the columns to a dictonary to be key-value.
print(dictionary)
'''
{'1': 'e1', '2': 'e2', '3': 'e3'}
'''
Now i loop through the specified conditions for each column matches and save it in a list. I'm converting this list of dataframes to a single dataframe with concat.
df = pd.concat([final[(final[k].str.contains('XY')) & (final[v] >= '2003-01-01')] for k,v in dictionary.items()]).drop_duplicates()
df = df[df1.columns]
print(df)
'''
1 2 3
0 XY00 AB80 XY01
3 XY70 AB23 DG60
'''