I have two csv files. csv1
looks like this:
Title,glide gscore,IFDScore
235,-9.01,-1020.18
235,-8.759,-1020.01
235,-7.301,-1019.28
while csv2
looks like this:
ID,smiles,number
28604361,NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3,102
14492699,COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C,235
16888863,COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO,108
Both are much larger than what I show here. I need some way to match each value in the Title
column of csv1
to the corresponding value in the number
column of csv2
. When a match is found, I need the value in the Title
column of csv1
to be replaced with the corresponding value in the ID
column of csv2
. Thus I would want my desired output to be:
Title,glide gscore,IFDScore
14492699,-9.01,-1020.18
14492699,-8.759,-1020.01
14492699,-7.301,-1019.28
I am looking for a way to do it through pandas, bash or python.
This answer is close but gives me an ambiguous truth value of a DataFrame.
I also tried
update
in pandas without luck.
I'm not pasting the exact code I've tried yet because it would be overwhelming to see faulty code in pandas, bash and python all at once.
CodePudding user response:
You could map
it; then use fillna
in case there were any "Titles" that did not have a matching "number":
csv1 = pd.read_csv('first_csv.csv')
csv2 = pd.read_csv('second_csv.csv')
csv1['Title'] = csv1['Title'].map(csv2.set_index('number')['ID']).fillna(csv1['Title']).astype(int)
Output:
Title glide gscore IFDScore
0 14492699 -9.010 -1020.18
1 14492699 -8.759 -1020.01
2 14492699 -7.301 -1019.28
CodePudding user response:
You can use pandas
module to load your dataframe, and then, using merge
function, you can achieve what you are seeking for:
import pandas as pd
df1 = pd.read_csv("df1.csv")
df2 = pd.read_csv("df2.csv")
merged = df1.merge(df2, left_on="Title", right_on="number", how="right")
merged["Title"] = merged["ID"]
merged
Output
Title | glide gscore | IFDScore | ID | smiles | number | |
---|---|---|---|---|---|---|
0 | 28604361 | nan | nan | 28604361 | NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3 | 102 |
1 | 14492699 | -9.01 | -1020.18 | 14492699 | COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C | 235 |
2 | 14492699 | -8.759 | -1020.01 | 14492699 | COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C | 235 |
3 | 14492699 | -7.301 | -1019.28 | 14492699 | COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C | 235 |
4 | 16888863 | nan | nan | 16888863 | COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO | 108 |
Note that the Nan
values are due to unavailable values. If your dataframe covers these parts too, it won't result in Nan
.