Home > Back-end >  How do I add a scalar value to grouped items in a pandas df
How do I add a scalar value to grouped items in a pandas df

Time:09-22

I have df to which I want to add a custom scalar that is in the lookup dataframe. So for every row that has chr1 I want to add 0, for every row that has chr2 I want to add 248956422, etc.:

lookup = pd.DataFrame(
[
    ["chr1", 0.0],
    ["chr2", "248956422.0"],
    ["chr3", "491149951.0"]
], 
    columns=["chromosome", "position"])

df = pd.DataFrame([
                    ["chr1", 50001],
                    ["chr1", 150001],
                    ["chr1", 250001],
                    ["chr2", 50001],
                    ["chr2", 350001],
                    ["chr3", 10000],
                    ["chr3", 110000],
                ], columns=["chrom", "midpoint"])

The end output should be like this:

        pd.DataFrame([
            ["chr1", 50001],
            ["chr1", 150001],
            ["chr1", 250001],
            ["chr2", 249006423],
            ["chr2", 249306423],
            ["chr3", 491159951],
            ["chr3", 491259951],
        ], columns=["chrom", "midpoint"])

I can do it in an apply function and loop through every row, but that seems inefficient. Is there a way to vectorize it and do it eficiently?

CodePudding user response:

Use Series.map by Series and add to original column midpoint:

s = df.set_index('chromosome')['position']
df2['midpoint']  = df2['chrom'].map(s).astype(float).astype(int)
print (df2)
  chrom   midpoint
0  chr1      50001
1  chr1     150001
2  chr1     250001
3  chr2  249006423
4  chr2  249306423
5  chr3  491159951
6  chr3  491259951

If possible some values not match, e.g. chr4:

df2 = pd.DataFrame([
                    ["chr1", 50001],
                    ["chr1", 150001],
                    ["chr1", 250001],
                    ["chr2", 50001],
                    ["chr2", 350001],
                    ["chr3", 10000],
                    ["chr4", 110000],
                ], columns=["chrom", "midpoint"])



s = df.set_index('chromosome')['position']
df2['midpoint']  = df2['chrom'].map(s).fillna(0).astype(float).astype(int)
print (df2)
  chrom   midpoint
0  chr1      50001
1  chr1     150001
2  chr1     250001
3  chr2  249006423
4  chr2  249306423
5  chr3  491159951
6  chr4     110000

CodePudding user response:

Or in one-line:

df2.set_index('chrom').add(df.set_index('chromosome').reindex(df2['chrom']).to_numpy().astype(float)).reset_index()

  chrom     midpoint
0  chr1      50001.0
1  chr1     150001.0
2  chr1     250001.0
3  chr2  249006423.0
4  chr2  249306423.0
5  chr3  491159951.0
6  chr3  491259951.0

Just set the index of df to chrom and reindex it to df2['chrom'] and add the corresponding index values together.

If some possibly don't match, do:

df2.set_index('chrom').add(df.set_index('chromosome').reindex(df2['chrom']).fillna(0).to_numpy().astype(float)).reset_index()
  • Related