Home > database >  applying function to list of dataframes in python
applying function to list of dataframes in python

Time:11-23

beginner python question here that I've had struggles getting answered from related stack questions.

I've got a list

dfList = df0,df1,df2,...,df7

I've got a function that I've defined and takes a dataframe as its argument. I'm not sure the function itself matters, but to be safe it is basically

def rateCalc (outcomeDataFrame):
    rateList = list()
    upperRateList = list()
    lowerRateList = list()
    for i in range(len(outcomeDataFrame)):
        lowlevel, highlevel = proportion_confint(count=outcomeDataFrame.iloc[i,4], nobs=outcomeDataFrame.iloc[i,3])
        lowerRateList.append(lowlevel)
        rateList.append(outcomeDataFrame.iloc[i,4]/outcomeDataFrame.iloc[i,3])
        upperRateList.append(highlevel)

    outcomeDataFrame = outcomeDataFrame.assign(lowerRate=lowerRateList)
    outcomeDataFrame = outcomeDataFrame.assign(midrate=rateList)
    outcomeDataFrame = outcomeDataFrame.assign(upperRate=upperRateList)

    return outcomeDataFrame

What I'm trying to do is append a the observed success ratio of two numbers as well as their 95% confidence interval. Goes fine when working with any individual df.

What I want to accomplish is turn each item of dfList into a version of itself with those lowerRate, midRate, and higherRate values appended as new columns.

When I try to apply across each dataframe with

for i in range(len(dfList):
   rateCalc(dfList[i])

though, it seems to only execute for df0. I can't make any sense of that; a full error I'd assume I had some basic flaw in the code, but it seems to work for df0 and then not iterate to df1 and beyond.

I also thought there may be an issue of "df1 != dfList[1]" in some backend sense (that running the function on the item in a list dfList[1] would not have any affect on the original item df1) but, again, the fact it seems to work with df0 would imply that's not the issue.

I also tried throwing some mud at the wall with the "map" function but am not sure I understand how to use that in this context (or any other for that matter ha)

Thanks all

CodePudding user response:

I think it is because the assing function returns another Data Frame which only exists inside the function scope, here is an example

import pandas as pd
df_0 = pd.DataFrame(data = [{'column':'a'}])
df_1 = pd.DataFrame(data = [{'column':'c'}])
df_2 = pd.DataFrame(data = [{'column':'d'}])
df_altos = df_0,df_1,df_2

def mod_df(df):
    test = list()
    test.append('d')
    #print('id before setting another column ' str(id(df)))
    #df['b'] = test
    print('id before assinging ' str(id(df)))
    df = df.assign(lowerRate = test)
    print('id after  assinging ' str(id(df)))
    return df

for i in range(len(df_altos)):
    mod_df(df_altos[i])

The returning id of each dataframe is the following

id before assinging 1833832455136
id after  assinging 1833832523568
id before assinging 1833832456144
id after  assinging 1833832525776
id before assinging 1833832454416
id after  assinging 1833832521888

As you can see, the id changes. You could try another atribution method, as the following

def mod_df(df):
    test = list()
    test.append('d')
    print('id before setting another column ' str(id(df)))
    df['b'] = test
    print('id after assinging ' str(id(df)))
    return df

which outputs

id before setting another column 1833831955520
id after assinging 1833831955520
id before setting another column 1833791973888
id after assinging 1833791973888
id before setting another column 1833791973264
id after assinging 1833791973264

Now the ids are the same and the new column exists on all the dataframes. How the first dataframe of you code was working i dont know.

  • Related