I have the following df:
Name | Time Elapsed | Amount |
---|---|---|
A | 2 | $2 |
A | 5 | $1 |
A | 7 | $6 |
B | 3 | $3 |
B | 5 | $5 |
I would like to expand this per group so that I get all the times in between the min and max, and then fill downwards (assuming it's sorted by name then by time elapsed) to produce the following output
Name | Time Elapsed | Amount |
---|---|---|
A | 2 | $2 |
A | 3 | $2 |
A | 4 | $2 |
A | 5 | $1 |
A | 6 | $1 |
A | 7 | $6 |
B | 3 | $3 |
B | 4 | $3 |
B | 5 | $5 |
I tried the following:
df = df(df.set_index('Time Elapsed')
.groupby('Name')['Amount']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max() 1))
.ffill().fillna(0)).reset_index)
and got a ValueError: cannot reindex on an axis with duplicate labels.
Feel free to offer a solution that does not use any of my code.
CodePudding user response:
try this:
def func(g: pd.DataFrame):
tmp = g.set_index('Time Elapsed')
res = tmp.reindex(
np.arange(tmp.index.min(), tmp.index.max() 1),
method='ffill')
return res
grouped = df.groupby('Name', as_index=False)
result = grouped.apply(func).reset_index().reindex(columns=df.columns)
print(result)
>>>
Name Time Elapsed Amount
0 A 2 $2
1 A 3 $2
2 A 4 $2
3 A 5 $1
4 A 6 $1
5 A 7 $6
6 B 3 $3
7 B 4 $3
8 B 5 $5
CodePudding user response:
One option is with complete from pyjanitor, to expose the missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
# build a dictionary of all the possible times
# the key of the dictionary should be the column to be expanded
times = {"Time Elapsed" : lambda df: range(df.min(), df.max() 1)}
df.complete(times, by = 'Name').ffill()
Name Time Elapsed Amount
0 A 2 $2
1 A 3 $2
2 A 4 $2
3 A 5 $1
4 A 6 $1
5 A 7 $6
6 B 3 $3
7 B 4 $3
8 B 5 $5