Home > Back-end >  Pandas .transform() results in NaN values after update to newer version
Pandas .transform() results in NaN values after update to newer version

Time:10-15

I have some code that used to function ~3-4 years ago. I've upgraded to newer versions of pandas, numpy, python since then and it has broken. I've isolated what I believe is the issue, but don't quite understand why it occurs.

def function_name(S):
    L = df2.reindex(S.index.droplevel(['column1','column2']))*len(S)
    return (-L/np.expm1(-L) - 1) 

gb = df.groupby(level=['name1', 'name2'])
    
dc = gb.transform(function_name)

Problem: the last line "dc" is a pandas.Series with only NaN values. It should have no NaN values.

Relevant information -- the gb object is correct and has no NaN or null values. Also, when I print out the "L" in the function, or the "return" in the function, I get the correct values. However, it's lost somewhere in the "dc" line. When I swap 'transform' to 'apply' I get the correct values out of 'dc' but the object has duplicate column labels that make it unusable.

Thanks!

EDIT:

Below is some minimal code I spun up to produce the error.

import pandas as pd
import numpy as np

df1_arrays = [
    np.array(["CAT","CAT","CAT","CAT","CAT","CAT","CAT","CAT"]),
    np.array(["A","A","A","A","B","B","B","B"]),
    np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]),
]

df2_arrays = [
    np.array(["A","A","A","A","B","B","B","B"]),
    np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]),
]

df1 = pd.Series(np.abs(np.random.randn(8))*100, index=df1_arrays)
df2 = pd.Series(np.abs(np.random.randn(8)), index=df2_arrays)

df1.index.set_names(["mouse", "target", "barcode"], inplace=True)
df2.index.set_names(["target", "barcode"], inplace=True)

def function_name(S):
    lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S)

    return (-lambdas/np.expm1(-lambdas) - 1)

gb = df1.groupby(level=['mouse','target'])

d_collisions = gb.transform(function_name)

print(d_collisions)




    mouse  target  barcode
CAT    A       AAAT      NaN
               AAAG      NaN
               AAAC      NaN
               AAAD      NaN
       B       AAAZ      NaN
               AAAX      NaN
               AAAW      NaN
               AAAM      NaN

CodePudding user response:

The cause of the NaNs is that your function outputs a DataFrame/Series with different indices, thus causing reindexing to NaNs.

You can return a numpy array in your function:

def function_name(S):
    lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S)

    return (-lambdas/np.expm1(-lambdas) - 1).to_numpy()  # convert to array here

gb = df1.groupby(level=['mouse','target'])

d_collisions = gb.transform(function_name)

output:

mouse  target  barcode
CAT    A       AAAT        6.338965
               AAAG        2.815679
               AAAC        0.547306
               AAAD        1.811785
       B       AAAZ        1.881744
               AAAX       10.986611
               AAAW        5.124226
               AAAM        0.250513
dtype: float64
  • Related