I have a dataframe
df = pd.DataFrame([["X","day_2"],["Y","day_4"],["Z","day_3"]],columns=["id","day"])
id day
X day_2
Y day_4
Z day_3
I want to increase the rows for each id till I reach day_5 starting from the next day mentioned in the day column. For example for X id day_2 is there get 3 rows starting from day_3 to day_5, for Y id get only 1 row day_5 for Z get 2 rows day_4 and day_5 as day_3 is there in day column.
Expected Output:
df = pd.DataFrame([["X","day_3"],["X","day_4"],["X","day_5"],["Y","day_5"],["Z","day_4"],["Z","day_5"]],columns=["id","day"])
id day
X day_3
X day_4
X day_5
Y day_5
Z day_4
Z day_5
How to do it?
CodePudding user response:
n = 5
out = (df.set_index('id')['day']
.apply(lambda x: [f'day_{i 1}' for i in range(int(x.split('_')[1]), n)])
.explode().reset_index())
out
id day
0 X day_3
1 X day_4
2 X day_5
3 Y day_5
4 Z day_4
5 Z day_5
you can change n
CodePudding user response:
You can extract
the digit, compute the number of missing values to go to 5, repeat
the rows and increment the values with groupby.cumcount
:
s = df['day'].str.extract(r'_(\d )', expand=False).astype(int)
out = (df
.assign(n=s)
.loc[df.index.repeat(s.rsub(5))]
.assign(day=lambda d: 'day_' d.pop('n')
.add(d.groupby('id').cumcount().add(1))
.astype(str))
)
NB. If you want a clean index, add reset_index(drop=True)
.
Output:
id day
0 X day_3
0 X day_4
0 X day_5
1 Y day_5
2 Z day_4
2 Z day_5
Variant if you also want to include the initial row.:
s = df['day'].str.extract(r'_(\d )', expand=False).astype(int)
out = (df
.assign(n=s)
.loc[df.index.repeat(s.rsub(5).add(1))]
.assign(day=lambda d: 'day_' d.pop('n')
.add(d.groupby('id').cumcount())
.astype(str))
)
Output:
id day
0 X day_2
0 X day_3
0 X day_4
0 X day_5
1 Y day_4
1 Y day_5
2 Z day_3
2 Z day_4
2 Z day_5
CodePudding user response:
a direct way would be
- get how many days are needed for each row
- then slice from an all-days Series that many days as lists
- explode to spread out the result to rows
N = 5
days_needed = df.day.str.split("_").str[-1].astype(int).rsub(N)
days_all = pd.Series(range(N)).add(1).astype(str).radd("day_")
(df.assign(day=days_needed.apply(lambda n: days_all.iloc[-n:].tolist()))
.explode("day", ignore_index=True))
to get
id day
0 X day_3
1 X day_4
2 X day_5
3 Y day_5
4 Z day_4
5 Z day_5
the intermediate values are
In [132]: days_all = pd.Series(range(N)).add(1).astype(str).radd("day_")
In [133]: days_all
Out[133]:
0 day_1
1 day_2
2 day_3
3 day_4
4 day_5
dtype: object
In [134]: days_needed = df.day.str.split("_").str[-1].astype(int).rsub(N)
In [135]: days_needed
Out[135]:
0 3
1 1
2 2
Name: day, dtype: int32
In [136]: days_needed.apply(lambda n: days_all.iloc[-n:].tolist())
Out[136]:
0 [day_3, day_4, day_5]
1 [day_5]
2 [day_4, day_5]
Name: day, dtype: object
CodePudding user response:
well here is my solution to this, define the old df
df = pd.DataFrame([["X","day_2"],["Y","day_4"],["Z","day_3"]],columns=["id","day"]
)
df = df.set_index('id')
create a new one to add the rows to, better than concatinating two pieces of dataframes with insertion in between
new_df = pd.DataFrame(columns=["id","day"])
a method to add the missing rows
def addRows( dayNum, df, _id):
while dayNum <= 5:
df.loc[df.shape[0]] = [_id, 'day_' str(dayNum)]
dayNum =1
actual work.
for i, r in df.iterrows():
dayNum = int (r['day'].split('_')[-1] )
if dayNum <= 5:
addRows(dayNum, new_df, i )
new_df.set_index('id')