I have df to which I want to add a custom scalar that is in the lookup dataframe. So for every row that has chr1 I want to add 0, for every row that has chr2 I want to add 248956422, etc.:
lookup = pd.DataFrame(
[
["chr1", 0.0],
["chr2", "248956422.0"],
["chr3", "491149951.0"]
],
columns=["chromosome", "position"])
df = pd.DataFrame([
["chr1", 50001],
["chr1", 150001],
["chr1", 250001],
["chr2", 50001],
["chr2", 350001],
["chr3", 10000],
["chr3", 110000],
], columns=["chrom", "midpoint"])
The end output should be like this:
pd.DataFrame([
["chr1", 50001],
["chr1", 150001],
["chr1", 250001],
["chr2", 249006423],
["chr2", 249306423],
["chr3", 491159951],
["chr3", 491259951],
], columns=["chrom", "midpoint"])
I can do it in an apply function and loop through every row, but that seems inefficient. Is there a way to vectorize it and do it eficiently?
CodePudding user response:
Use Series.map
by Series
and add to original column midpoint
:
s = df.set_index('chromosome')['position']
df2['midpoint'] = df2['chrom'].map(s).astype(float).astype(int)
print (df2)
chrom midpoint
0 chr1 50001
1 chr1 150001
2 chr1 250001
3 chr2 249006423
4 chr2 249306423
5 chr3 491159951
6 chr3 491259951
If possible some values not match, e.g. chr4
:
df2 = pd.DataFrame([
["chr1", 50001],
["chr1", 150001],
["chr1", 250001],
["chr2", 50001],
["chr2", 350001],
["chr3", 10000],
["chr4", 110000],
], columns=["chrom", "midpoint"])
s = df.set_index('chromosome')['position']
df2['midpoint'] = df2['chrom'].map(s).fillna(0).astype(float).astype(int)
print (df2)
chrom midpoint
0 chr1 50001
1 chr1 150001
2 chr1 250001
3 chr2 249006423
4 chr2 249306423
5 chr3 491159951
6 chr4 110000
CodePudding user response:
Or in one-line:
df2.set_index('chrom').add(df.set_index('chromosome').reindex(df2['chrom']).to_numpy().astype(float)).reset_index()
chrom midpoint
0 chr1 50001.0
1 chr1 150001.0
2 chr1 250001.0
3 chr2 249006423.0
4 chr2 249306423.0
5 chr3 491159951.0
6 chr3 491259951.0
Just set the index of df
to chrom
and reindex it to df2['chrom']
and add the corresponding index values together.
If some possibly don't match, do:
df2.set_index('chrom').add(df.set_index('chromosome').reindex(df2['chrom']).fillna(0).to_numpy().astype(float)).reset_index()