I have some code that used to function ~3-4 years ago. I've upgraded to newer versions of pandas, numpy, python since then and it has broken. I've isolated what I believe is the issue, but don't quite understand why it occurs.
def function_name(S):
L = df2.reindex(S.index.droplevel(['column1','column2']))*len(S)
return (-L/np.expm1(-L) - 1)
gb = df.groupby(level=['name1', 'name2'])
dc = gb.transform(function_name)
Problem: the last line "dc" is a pandas.Series with only NaN values. It should have no NaN values.
Relevant information -- the gb object is correct and has no NaN or null values. Also, when I print out the "L" in the function, or the "return" in the function, I get the correct values. However, it's lost somewhere in the "dc" line. When I swap 'transform' to 'apply' I get the correct values out of 'dc' but the object has duplicate column labels that make it unusable.
Thanks!
EDIT:
Below is some minimal code I spun up to produce the error.
import pandas as pd
import numpy as np
df1_arrays = [
np.array(["CAT","CAT","CAT","CAT","CAT","CAT","CAT","CAT"]),
np.array(["A","A","A","A","B","B","B","B"]),
np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]),
]
df2_arrays = [
np.array(["A","A","A","A","B","B","B","B"]),
np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]),
]
df1 = pd.Series(np.abs(np.random.randn(8))*100, index=df1_arrays)
df2 = pd.Series(np.abs(np.random.randn(8)), index=df2_arrays)
df1.index.set_names(["mouse", "target", "barcode"], inplace=True)
df2.index.set_names(["target", "barcode"], inplace=True)
def function_name(S):
lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S)
return (-lambdas/np.expm1(-lambdas) - 1)
gb = df1.groupby(level=['mouse','target'])
d_collisions = gb.transform(function_name)
print(d_collisions)
mouse target barcode
CAT A AAAT NaN
AAAG NaN
AAAC NaN
AAAD NaN
B AAAZ NaN
AAAX NaN
AAAW NaN
AAAM NaN
CodePudding user response:
The cause of the NaNs is that your function outputs a DataFrame/Series with different indices, thus causing reindexing to NaNs.
You can return a numpy array in your function:
def function_name(S):
lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S)
return (-lambdas/np.expm1(-lambdas) - 1).to_numpy() # convert to array here
gb = df1.groupby(level=['mouse','target'])
d_collisions = gb.transform(function_name)
output:
mouse target barcode
CAT A AAAT 6.338965
AAAG 2.815679
AAAC 0.547306
AAAD 1.811785
B AAAZ 1.881744
AAAX 10.986611
AAAW 5.124226
AAAM 0.250513
dtype: float64