Cleaning Pandas df with varying column types and values-CodePudding

Good evening,

My pandas df (python) looks like this:

I would like to do the following:

Create a date column using col 0 and col 1 -> 11 Apr
Join the strings that is between the date and first numeric value and label it as Description 1.
Extract the first numeric value and label it as Amount 1
Extract the second numeric value and label it as Amount 2
Join the strings that are after the numeric value and label it as Description 2.

In the end, my pandas df would have:

Date: 11 Apr
Description 1: abcd efgh ijklmnop
Amount 1: 425.85 (12.34 is a just a placeholder amount in the illustration)
Amount 2: 365.12 (12.34 is a just a placeholder amount in the illustration)
Description 2: ab cdefgh ijklm

How do I effectively clean this df to achieve my desire outcome?

Thank you!

Sample Data:

{0: {20: '11', 21: '11', 22: '14', 23: '16', 24: '18', 25: '19', 26: '19'}, 1: {20: 'Apr', 21: 'Apr', 22: 'Apr', 23: 'Apr', 24: 'Apr', 25: 'Apr', 26: 'Apr'}, 2: {20: 'ACTNOWQUICK', 21: 'Cash', 22: 'ACTNOWQUICK', 23: 'ACTNOWQUICK', 24: 'Inward', 25: 'Cash', 26: 'Inward'}, 3: {20: '1234.56', 21: 'WithdrawalATM', 22: '76.53', 23: '1236.00', 24: 'DR', 25: 'WithdrawalATM', 26: 'CR'}, 4: {20: '1234.98', 21: '50.00', 22: '653.24', 23: '1234.78', 24: 'FUTHN', 25: '70.00', 26: 'YJHK'}, 5: {20: 'HYE912630964589376', 21: '1111.22', 22: 'HYE91234234589376', 23: 'HYE91263234234234376', 24: '60.00', 25: '222.22', 26: '33333.33'}, 6: {20: 'PLUTO', 21: '23523455', 22: 'WiN', 23: 'YOU', 24: '11.11', 25: '123123123', 26: '18.18'}, 7: {20: 'THEATRE', 21: None, 22: 'OTHR', 23: 'TECHY', 24: 'WOL', 25: None, 26: 'OTHER'}, 8: {20: 'OTHER', 21: None, 22: 'JOHNKLING', 23: 'BRO', 24: 'E54E236A58', 25: None, 26: 'Other'}, 9: {20: 'WUN', 21: None, 22: None, 23: 'OTHER', 24: 'FFF', 25: None, 26: 'PFFS'}, 10: {20: 'Cool', 21: None, 22: None, 23: '123123123523452', 24: 'UEJH', 25: None, 26: '(JUPITER)'}, 11: {20: 'Beans', 21: None, 22: None, 23: None, 24: None, 25: None, 26: 'EVEREST'}, 12: {20: 'KIng', 21: None, 22: None, 23: None, 24: None, 25: None, 26: '236272345235'}, 13: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}, 14: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}, 15: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}, 16: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}}

CodePudding user response：

We can join the columns on not None values, and then regex can be applied using pd.extract with expand = True option to get the required groups.

dict_ = {0: {20: '11', 21: '11', 22: '14', 23: '16', 24: '18', 25: '19', 26: '19'}, 1: {20: 'Apr', 21: 'Apr', 22: 'Apr', 23: 'Apr', 24: 'Apr', 25: 'Apr', 26: 'Apr'}, 2: {20: 'ACTNOWQUICK', 21: 'Cash', 22: 'ACTNOWQUICK', 23: 'ACTNOWQUICK', 24: 'Inward', 25: 'Cash', 26: 'Inward'}, 3: {20: '1234.56', 21: 'WithdrawalATM', 22: '76.53', 23: '1236.00', 24: 'DR', 25: 'WithdrawalATM', 26: 'CR'}, 4: {20: '1234.98', 21: '50.00', 22: '653.24', 23: '1234.78', 24: 'FUTHN', 25: '70.00', 26: 'YJHK'}, 5: {20: 'HYE912630964589376', 21: '1111.22', 22: 'HYE91234234589376', 23: 'HYE91263234234234376', 24: '60.00', 25: '222.22', 26: '33333.33'}, 6: {20: 'PLUTO', 21: '23523455', 22: 'WiN', 23: 'YOU', 24: '11.11', 25: '123123123', 26: '18.18'}, 7: {20: 'THEATRE', 21: None, 22: 'OTHR', 23: 'TECHY', 24: 'WOL', 25: None, 26: 'OTHER'}, 8: {20: 'OTHER', 21: None, 22: 'JOHNKLING', 23: 'BRO', 24: 'E54E236A58', 25: None, 26: 'Other'}, 9: {20: 'WUN', 21: None, 22: None, 23: 'OTHER', 24: 'FFF', 25: None, 26: 'PFFS'}, 10: {20: 'Cool', 21: None, 22: None, 23: '123123123523452', 24: 'UEJH', 25: None, 26: '(JUPITER)'}, 11: {20: 'Beans', 21: None, 22: None, 23: None, 24: None, 25: None, 26: 'EVEREST'}, 12: {20: 'KIng', 21: None, 22: None, 23: None, 24: None, 25: None, 26: '236272345235'}, 13: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}, 14: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}, 15: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}, 16: {20: None, 21: None, 22: None, 23: None, 24: None, 25: None, 26: None}}
df = pd.DataFrame(dict_)
df[['Decription1', 'Amount1', 'Amount2', 'Description2']] = df[df.columns[~df.columns.isin([0,1])]].apply(lambda x: ' '.join(x.dropna()), axis=1).str.extract(r'([a-zA-Z0-9]*) ([0-9]*[,.][0-9]*).* ([0-9]*[,.][0-9]*)(.*)', expand=True)

Output

This gives us the expected output