Increase the number of rows till we reach some condition in pandas-CodePudding

I have a dataframe

df = pd.DataFrame([["X","day_2"],["Y","day_4"],["Z","day_3"]],columns=["id","day"])

id  day
X   day_2
Y   day_4
Z   day_3

I want to increase the rows for each id till I reach day_5 starting from the next day mentioned in the day column. For example for X id day_2 is there get 3 rows starting from day_3 to day_5, for Y id get only 1 row day_5 for Z get 2 rows day_4 and day_5 as day_3 is there in day column.

Expected Output:

df = pd.DataFrame([["X","day_3"],["X","day_4"],["X","day_5"],["Y","day_5"],["Z","day_4"],["Z","day_5"]],columns=["id","day"])

id  day
X   day_3
X   day_4
X   day_5
Y   day_5
Z   day_4
Z   day_5

How to do it?

CodePudding user response：

n = 5
out = (df.set_index('id')['day']
       .apply(lambda x: [f'day_{i 1}' for i in range(int(x.split('_')[1]), n)])
       .explode().reset_index())

out

    id  day
0   X   day_3
1   X   day_4
2   X   day_5
3   Y   day_5
4   Z   day_4
5   Z   day_5

you can change n

CodePudding user response：

You can extract the digit, compute the number of missing values to go to 5, repeat the rows and increment the values with groupby.cumcount:

s = df['day'].str.extract(r'_(\d )', expand=False).astype(int)

out = (df
   .assign(n=s)
   .loc[df.index.repeat(s.rsub(5))]
   .assign(day=lambda d: 'day_'   d.pop('n')
                                   .add(d.groupby('id').cumcount().add(1))
                                   .astype(str))
 )

NB. If you want a clean index, add reset_index(drop=True).

Output:

  id    day
0  X  day_3
0  X  day_4
0  X  day_5
1  Y  day_5
2  Z  day_4
2  Z  day_5

Variant if you also want to include the initial row.:

s = df['day'].str.extract(r'_(\d )', expand=False).astype(int)

out = (df
   .assign(n=s)
   .loc[df.index.repeat(s.rsub(5).add(1))]
   .assign(day=lambda d: 'day_'   d.pop('n')
                                   .add(d.groupby('id').cumcount())
                                   .astype(str))
 )

Output:

  id    day
0  X  day_2
0  X  day_3
0  X  day_4
0  X  day_5
1  Y  day_4
1  Y  day_5
2  Z  day_3
2  Z  day_4
2  Z  day_5

CodePudding user response：

a direct way would be

get how many days are needed for each row
then slice from an all-days Series that many days as lists
explode to spread out the result to rows

N = 5

days_needed = df.day.str.split("_").str[-1].astype(int).rsub(N)
days_all    = pd.Series(range(N)).add(1).astype(str).radd("day_")

(df.assign(day=days_needed.apply(lambda n: days_all.iloc[-n:].tolist()))
   .explode("day", ignore_index=True))

to get

  id    day
0  X  day_3
1  X  day_4
2  X  day_5
3  Y  day_5
4  Z  day_4
5  Z  day_5

the intermediate values are

In [132]: days_all = pd.Series(range(N)).add(1).astype(str).radd("day_")

In [133]: days_all
Out[133]:
0    day_1
1    day_2
2    day_3
3    day_4
4    day_5
dtype: object

In [134]: days_needed = df.day.str.split("_").str[-1].astype(int).rsub(N)

In [135]: days_needed
Out[135]:
0    3
1    1
2    2
Name: day, dtype: int32

In [136]: days_needed.apply(lambda n: days_all.iloc[-n:].tolist())
Out[136]:
0    [day_3, day_4, day_5]
1                  [day_5]
2           [day_4, day_5]
Name: day, dtype: object

CodePudding user response：

well here is my solution to this, define the old df

df = pd.DataFrame([["X","day_2"],["Y","day_4"],["Z","day_3"]],columns=["id","day"]
                  )
df = df.set_index('id')

create a new one to add the rows to, better than concatinating two pieces of dataframes with insertion in between

new_df = pd.DataFrame(columns=["id","day"])

a method to add the missing rows

def addRows( dayNum, df, _id):
    while dayNum <= 5:
        df.loc[df.shape[0]] = [_id, 'day_' str(dayNum)]
        dayNum =1

actual work.

for i, r in df.iterrows():
    dayNum = int (r['day'].split('_')[-1] )
    if dayNum <= 5:
        addRows(dayNum, new_df, i )
new_df.set_index('id')