Tricky duplicate rows based on condition and add a counter in Python-CodePudding

I have a dataframe where if a certain condition is met, I'd like to essentially create a duplicate of that row. Row should be duplicated IF 'Date' = Q4.22 or > AND type = 'live' Also, for every duplicate created the 'unit' count should be updated to reflect this (grouped by id and Date) Once the duplicate is established, the unit count should reflect the new count based on the same id and Date.

Data

   id   Date    set     type    unit    energy
   bb   Q4.22   l       live    l01     20
   bb   Q4.22   l       live    l02     20
   ba   Q3.22   l       non     l01     20
   aa   Q4.22   l       non     l01     20
   aa   Q4.22   l       live    l01     20
   cc   Q3.22   l       non     l01     20
   aa   Q4.22   l       live    l02     20

Desired

   id   Date    set     type    unit    energy
   bb   Q4.22   l       live    l01     20
   bb   Q4.22   l       live    l02     20
   bb   Q4.22   l       live    l03     20
   bb   Q4.22   l       live    l04     20
   ba   Q3.22   l       non     l01     20
   aa   Q4.22   l       non     l01     20
   aa   Q4.22   l       live    l01     20
   aa   Q4.22   l       live    l02     20
   cc   Q3.22   l       non     l01     20
   aa   Q4.22   l       live    l03     20
   aa   Q4.22   l       live    l04     20

Doing

pd.concat([df, df.loc[(df['Date'] == > 'Q4.22') & (df['live'] == 'live')]])

However, I still need to add counter to the new duplicates that are created. Any suggestion is appreciated.

CodePudding user response：

Try:

Convert your date column to timestamps
concat your original data with the filtered data
groupby to get the cumcount of "id" and "Date" and set the "unit" accordingly

df["Date"] = pd.to_datetime(df["Date"].str.replace(r"(Q\d).(\d )", r"\2-\1",regex=True))

output = pd.concat([df, df[df["Date"].ge(pd.Timestamp("2022-10-01"))&df["type"].eq("live")]], ignore_index=True)
output["unit"] = output["set"] output.groupby(["id", "Date"]).cumcount().add(1).astype(str).str.zfill(2)
output = output.sort_values("id", ignore_index=True)

#convert Date back to original format if needed
output["Date"] = output["Date"].dt.to_period("Q").astype(str).str.replace(r"\d\d(\d )(Q\d)",r"\2.\1",regex=True)

>>> output
    id   Date set  type unit  energy
0   aa  Q4.22   l   non  l01      20
1   aa  Q4.22   l  live  l02      20
2   aa  Q4.22   l  live  l03      20
3   aa  Q4.22   l  live  l04      20
4   aa  Q4.22   l  live  l05      20
5   ba  Q3.22   l   non  l01      20
6   bb  Q4.22   l  live  l01      20
7   bb  Q4.22   l  live  l02      20
8   bb  Q4.22   l  live  l03      20
9   bb  Q4.22   l  live  l04      20
10  cc  Q3.22   l   non  l01      20

CodePudding user response：

First, as noted in the comments, we need to convert some of the df columns into more convenient types:

int for unit (stripping any chars),
pd.Period for the Date.

df2 = df.assign(
    unit=df['unit'].str.extract(r'(\d )').astype(int),
    period=df['Date'].str.replace(r'^(Q\d)\D*(\d )$', r'\2\1', regex=True).apply(pd.Period)
)

>>> df2
   id   Date set  type  unit  energy  period
0  bb  Q4.22   l  live     1      20  2022Q4
1  bb  Q4.22   l  live     2      20  2022Q4
2  ba  Q3.22   l   non     1      20  2022Q3
3  aa  Q4.22   l   non     1      20  2022Q4
4  aa  Q4.22   l  live     1      20  2022Q4
5  cc  Q3.22   l   non     1      20  2022Q3
6  aa  Q4.22   l  live     2      20  2022Q4

>>> df2.dtypes
id               object
Date             object
set              object
type             object
unit              int64
energy            int64
period    period[Q-DEC]
dtype: object

With this done, now we can proceed with the logic of the question itself.

ix_repeat = (df2['period'] >= pd.Period('2022-Q4')) & (df2['type'] == 'live')
r = df2.loc[ix_repeat]
r.assign(unit=r['unit']   r.groupby(['id', 'period'])['unit'].transform(max))

>>> r
   id   Date set  type  unit  energy  period
0  bb  Q4.22   l  live     3      20  2022Q4
1  bb  Q4.22   l  live     4      20  2022Q4
4  aa  Q4.22   l  live     3      20  2022Q4
6  aa  Q4.22   l  live     4      20  2022Q4

# finally
df2 = pd.concat([df2, r])

Optional: bringing back unit into its weird string version:

df2 = df2.assign(unit=df2['set']   df2['unit'].astype(str).str.zfill(2))

>>> df2
   id   Date set  type unit  energy  period
0  bb  Q4.22   l  live  l01      20  2022Q4
1  bb  Q4.22   l  live  l02      20  2022Q4
2  ba  Q3.22   l   non  l01      20  2022Q3
3  aa  Q4.22   l   non  l01      20  2022Q4
4  aa  Q4.22   l  live  l01      20  2022Q4
5  cc  Q3.22   l   non  l01      20  2022Q3
6  aa  Q4.22   l  live  l02      20  2022Q4
0  bb  Q4.22   l  live  l03      20  2022Q4
1  bb  Q4.22   l  live  l04      20  2022Q4
4  aa  Q4.22   l  live  l03      20  2022Q4
6  aa  Q4.22   l  live  l04      20  2022Q4