Home > Software engineering >  Break out DataFrame column into multiple columns using apply
Break out DataFrame column into multiple columns using apply

Time:05-18

I am attempting to break out a DataFrame column into several varying ones based upon a function that parses the original column contents. They contain something that my function can turn into a dataframe with varying column names. All columns need to be added to the end of the existing dataframe. The columns should not duplicate in name. The below is a simplified version of what I'm trying to do. It errors out.

EDIT: One point of clarification, please disregard the fact that I have used a dict to form sub_transaction. The sub_transaction column in actuality has a lengthy XML in it that is turned into a DataFrame by parse_subtransaction. The simpler dict was just for example purposes. The important point is that a function must be used to parse it and that function returns a DataFrame.

original dataframe

transaction_id               sub_transaction
          abc1  {'id': 'abc1x', 'total': 10}
          abc2  {'id': 'abc2x', 'total': 20}
          abc3  {'id': 'abc3x', 'total': 30}
          abc4                            {}
          abc5               {'id': 'abc5x'}

desired dataframe outcome

transaction_id  sub_transaction_id  total
abc1                         abc1x     10
abc2                         abc2x     20
abc3                         abc3x     30
import pandas as pd

def parse_subtransaction(sub_transaction):
    return pd.DataFrame({
        'sub_transaction_id': [sub_transaction.get('id')],
        'total': [sub_transaction.get('total')]})

def main():
    df = pd.DataFrame({
        'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
        'sub_transaction': [
            {'id': 'abc1x', 'total': 10},
            {'id': 'abc2x', 'total': 20},
            {'id': 'abc3x', 'total': 30},
            {},
            {'id':'abc5x'}]
        })

    applied_df = df.apply(
        lambda row: parse_subtransaction(row['sub_transaction']),
        axis='columns',
        result_type='expand')

# ERROR: ValueError: If using all scalar values, you must pass an index

if (__name__ == "__main__"):
    main()

CodePudding user response:

You could accomplish the same using:

df.join(pd.DataFrame(df.sub_transaction.tolist()))
 
  transaction_id               sub_transaction     id  total
0           abc1  {'id': 'abc1x', 'total': 10}  abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}  abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}  abc3x   30.0
3           abc4                            {}    NaN    NaN
4           abc5               {'id': 'abc5x'}  abc5x    NaN

CodePudding user response:

One option is with pandas string get:

df.assign(sub_transaction_id = df.sub_transaction.str.get('id'), 
          total = df.sub_transaction.str.get('total'))

  transaction_id               sub_transaction sub_transaction_id  total
0           abc1  {'id': 'abc1x', 'total': 10}              abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}              abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}              abc3x   30.0
3           abc4                            {}               None    NaN
4           abc5               {'id': 'abc5x'}              abc5x    NaN

The apply returns your function per row, which I suspect is not what you want, you probably want a single DataFrame containing your extracts.

CodePudding user response:

To use your own style:

import pandas as pd

def parse_subtransaction(sub_transaction):
    return ({'sub_transaction_id': sub_transaction.get('id'), 'total': sub_transaction.get('total')})
   

def main():
    df = pd.DataFrame({'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
                       'sub_transaction': [{'id': 'abc1x', 'total': 10}, {'id': 'abc2x', 'total': 20},
                                           {'id': 'abc3x', 'total': 30},{},{'id':'abc5x'}]})

    applied_df = df.apply(lambda row: parse_subtransaction(row['sub_transaction']), axis='columns', result_type='expand')
    final_df = pd.concat([df.iloc[: , :-1], applied_df], axis=1)
    print(final_df)
    
main()

CodePudding user response:

parse_subtransaction should return a dict or Series, not a DataFrame.*

def parse_subtransaction(sub_transaction):
    return {
        'sub_transaction_id': sub_transaction.get('id'),
        'total': sub_transaction.get('total')}

Then to rejoin, we can use a variation of joris's solution:

pd.concat([df.drop(columns='sub_transaction'), applied_df], axis=1)

* Although I'm not sure why exactly. I looked at the docs and type annotation, but couldn't find anything that specified the func parameter's return type precisely.

  • Related