I've joined 3 Excel tab data sets to give me my base dataframe, and then for each line I want to count the int values in the comma separated values in DUAlloc and divide Amount by the DUAlloc Count then loop through the DuAlloc list and assign individual lines e.g.
Base Data:
Description | DuAlloc | Amount |
---|---|---|
Blah | 1,2,3,4,5 | 1000 |
Yada | 30,15,3,4,5 | 200 |
Processed Data:
Description | DuAlloc | Amount |
---|---|---|
Blah | 1 | 200 |
Blah | 2 | 200 |
Blah | 3 | 200 |
Yada | 3 | 40 |
Blah | 4 | 200 |
Yada | 4 | 40 |
Blah | 5 | 200 |
Yada | 5 | 40 |
Yada | 15 | 40 |
Yada | 30 | 40 |
I've tried numerous ways to convert to a list: list(), tolist(), but either get the same number for all the counts, or the nearest I've come is [len(str(c)) for c in df3['DUAlloc']]
which counts all the characters which I don't want.
How would I go about achieving this, and is Pandas the best route to take?
CodePudding user response:
Use Series.str.split
, df.explode
, Groupby.transform
and df.div
:
In [501]: out = df.assign(DuAlloc=df['DuAlloc'].str.split(',')).explode('DuAlloc')
In [506]: out['Amount'] = out['Amount'].div(out.groupby('Description')['Amount'].transform('size'))
In [507]: out
Out[507]:
Description DuAlloc Amount
0 Blah 1 200.0
0 Blah 2 200.0
0 Blah 3 200.0
0 Blah 4 200.0
0 Blah 5 200.0
1 Yada 30 40.0
1 Yada 15 40.0
1 Yada 3 40.0
1 Yada 4 40.0
1 Yada 5 40.0
CodePudding user response:
You can use .str.count
to count the number of ,
in columns.
out = (df.assign(Amount=df['Amount'].div(df['DuAlloc'].str.count(',').add(1)),
DuAlloc=df['DuAlloc'].str.split(','))
.explode('DuAlloc'))
print(out)
Description DuAlloc Amount
0 Blah 1 200.0
0 Blah 2 200.0
0 Blah 3 200.0
0 Blah 4 200.0
0 Blah 5 200.0
1 Yada 30 40.0
1 Yada 15 40.0
1 Yada 3 40.0
1 Yada 4 40.0
1 Yada 5 40.0