I have two separate df frames that I want to compare:
f1
P53-Malat1
Neat1-Malat1
Gap1-Malat1
and f2:
intA,intB
P53-Malat1,Neat1-Malat1
Gap1-Malat1,Malat1-Pias3
I want to iterate over the rows of each column in f2 and see id it is inside the f1. If yes then print that row "found" and if no then print that row "not_found" in a seperate column.
The same for column two in f2.
I tried this approach but it doesn't work - am I missing something?
with open("f1.txt","r") as f1:
content = f1.read().splitlines()
#print(content)
f2 = pd.read_csv("f2.csv")
f2["col1_search"] = f2.apply(lambda x: x["intA"] "_found" if x in content else x["intA"] "_not_found", axis=1)
f2["col2_search"] = f2.apply(lambda x: x["intB"] "_found" if x in content else x["intB"] "_not_found", axis=1)
so the desired output should be f2 in this format:
col1_search,col2_search
P53-Malat1_found,Neat1-Malat1_found
Gap1-Malat1_found,Malat1-Pias3_not_found
Thank you.
CodePudding user response:
If I understand correctly content is a list not a dataframe. If this is the case you can us .isin
which will return True
or False
for each row which can be mapped to whatever suffix you want.
import pandas as pd
content = ['P53-Malat1','Neat1-Malat1','Gap1-Malat1']
f2 = pd.DataFrame({'intA': {0: 'P53-Malat1', 1: 'Gap1-Malat1'},
'intB': {0: 'Neat1-Malat1', 1: 'Malat1-Pias3'}})
f2['col1_search'] = f2.intA f2.intA.isin(content).map({True:'_found',False:'_not_found'})
f2['col2_search'] = f2.intB f2.intB.isin(content).map({True:'_found',False:'_not_found'})
Output
intA intB col1_search col2_search
0 P53-Malat1 Neat1-Malat1 P53-Malat1_found Neat1-Malat1_found
1 Gap1-Malat1 Malat1-Pias3 Gap1-Malat1_found Malat1-Pias3_not_found
Or perhaps if you have many columns:
(f2 f2.isin(content).replace({True:'_found',False:'_not_found'})).add_suffix('_search')
Output
intA_search intB_search
0 P53-Malat1_found Neat1-Malat1_found
1 Gap1-Malat1_found Malat1-Pias3_not_found
which could be merged back to the original data with
pd.concat([f2,(f2 f2.isin(content).replace({True:'_found',False:'_not_found'})).add_suffix('_search')], axis=1)
Output
intA intB intA_search intB_search
0 P53-Malat1 Neat1-Malat1 P53-Malat1_found Neat1-Malat1_found
1 Gap1-Malat1 Malat1-Pias3 Gap1-Malat1_found Malat1-Pias3_not_found
CodePudding user response:
This is an example of how you would use the np.where
data = {'Category' : ['First', 'Second', 'Third'],
'First_Numbers' : [10, 10, 10],
'Second_Numbers' : [20, 20, 20],
'Third_Numbers' : [9, 21, 15]
}
df = pd.DataFrame(data)
comp_column = np.where((df['Third_Numbers'] < df['Second_Numbers']) & (df['Third_Numbers'] > df['First_Numbers']), 'found', 'not found')
df['check'] = comp_column
df
I inserted some sample data which you should be able to replace with your own data. Now I see your wanting to compare between 2 different dfs so I would recommend merging them so that you are only working on a single df. This is the best documentation for merging/joining/concating pandas df: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
CodePudding user response:
f2 = pd.read_csv("f2.csv")
def transform(path:str,x):
with open(path,"r") as f1:
content = f1.read().splitlines()
if x in content:
return f"{x}_found"
return f"{x}_not_found"
f2["col1_search"] = f2['intA'].apply(lambda x:transform("f2.csv", x.intA),axis=1)
f2["col2_search"] = f2['intB'].apply(lambda x:transform("f2.csv", x.intB),axis=1)
enter code here