I am attempting to break out a DataFrame column into several varying ones based upon a function that parses the original column contents. They contain something that my function can turn into a dataframe with varying column names. All columns need to be added to the end of the existing dataframe. The columns should not duplicate in name. The below is a simplified version of what I'm trying to do. It errors out.
EDIT: One point of clarification, please disregard the fact that I have used a dict to form sub_transaction. The sub_transaction column in actuality has a lengthy XML in it that is turned into a DataFrame by parse_subtransaction. The simpler dict was just for example purposes. The important point is that a function must be used to parse it and that function returns a DataFrame.
original dataframe
transaction_id sub_transaction
abc1 {'id': 'abc1x', 'total': 10}
abc2 {'id': 'abc2x', 'total': 20}
abc3 {'id': 'abc3x', 'total': 30}
abc4 {}
abc5 {'id': 'abc5x'}
desired dataframe outcome
transaction_id sub_transaction_id total
abc1 abc1x 10
abc2 abc2x 20
abc3 abc3x 30
import pandas as pd
def parse_subtransaction(sub_transaction):
return pd.DataFrame({
'sub_transaction_id': [sub_transaction.get('id')],
'total': [sub_transaction.get('total')]})
def main():
df = pd.DataFrame({
'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
'sub_transaction': [
{'id': 'abc1x', 'total': 10},
{'id': 'abc2x', 'total': 20},
{'id': 'abc3x', 'total': 30},
{},
{'id':'abc5x'}]
})
applied_df = df.apply(
lambda row: parse_subtransaction(row['sub_transaction']),
axis='columns',
result_type='expand')
# ERROR: ValueError: If using all scalar values, you must pass an index
if (__name__ == "__main__"):
main()
CodePudding user response:
You could accomplish the same using:
df.join(pd.DataFrame(df.sub_transaction.tolist()))
transaction_id sub_transaction id total
0 abc1 {'id': 'abc1x', 'total': 10} abc1x 10.0
1 abc2 {'id': 'abc2x', 'total': 20} abc2x 20.0
2 abc3 {'id': 'abc3x', 'total': 30} abc3x 30.0
3 abc4 {} NaN NaN
4 abc5 {'id': 'abc5x'} abc5x NaN
CodePudding user response:
One option is with pandas string get
:
df.assign(sub_transaction_id = df.sub_transaction.str.get('id'),
total = df.sub_transaction.str.get('total'))
transaction_id sub_transaction sub_transaction_id total
0 abc1 {'id': 'abc1x', 'total': 10} abc1x 10.0
1 abc2 {'id': 'abc2x', 'total': 20} abc2x 20.0
2 abc3 {'id': 'abc3x', 'total': 30} abc3x 30.0
3 abc4 {} None NaN
4 abc5 {'id': 'abc5x'} abc5x NaN
The apply returns your function per row, which I suspect is not what you want, you probably want a single DataFrame containing your extracts.
CodePudding user response:
To use your own style:
import pandas as pd
def parse_subtransaction(sub_transaction):
return ({'sub_transaction_id': sub_transaction.get('id'), 'total': sub_transaction.get('total')})
def main():
df = pd.DataFrame({'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
'sub_transaction': [{'id': 'abc1x', 'total': 10}, {'id': 'abc2x', 'total': 20},
{'id': 'abc3x', 'total': 30},{},{'id':'abc5x'}]})
applied_df = df.apply(lambda row: parse_subtransaction(row['sub_transaction']), axis='columns', result_type='expand')
final_df = pd.concat([df.iloc[: , :-1], applied_df], axis=1)
print(final_df)
main()
CodePudding user response:
parse_subtransaction
should return a dict
or Series
, not a DataFrame
.*
def parse_subtransaction(sub_transaction):
return {
'sub_transaction_id': sub_transaction.get('id'),
'total': sub_transaction.get('total')}
Then to rejoin, we can use a variation of joris's solution:
pd.concat([df.drop(columns='sub_transaction'), applied_df], axis=1)
* Although I'm not sure why exactly. I looked at the docs and type annotation, but couldn't find anything that specified the func parameter's return type precisely.