I have a dataframe where if a certain condition is met, I'd like to essentially create a duplicate of that row. Row should be duplicated IF 'Date' = Q4.22 or > AND type = 'live' Also, for every duplicate created the 'unit' count should be updated to reflect this (grouped by id and Date) Once the duplicate is established, the unit count should reflect the new count based on the same id and Date.
Data
id Date set type unit energy
bb Q4.22 l live l01 20
bb Q4.22 l live l02 20
ba Q3.22 l non l01 20
aa Q4.22 l non l01 20
aa Q4.22 l live l01 20
cc Q3.22 l non l01 20
aa Q4.22 l live l02 20
Desired
id Date set type unit energy
bb Q4.22 l live l01 20
bb Q4.22 l live l02 20
bb Q4.22 l live l03 20
bb Q4.22 l live l04 20
ba Q3.22 l non l01 20
aa Q4.22 l non l01 20
aa Q4.22 l live l01 20
aa Q4.22 l live l02 20
cc Q3.22 l non l01 20
aa Q4.22 l live l03 20
aa Q4.22 l live l04 20
Doing
pd.concat([df, df.loc[(df['Date'] == > 'Q4.22') & (df['live'] == 'live')]])
However, I still need to add counter to the new duplicates that are created. Any suggestion is appreciated.
CodePudding user response:
Try:
- Convert your date column to timestamps
concat
your original data with the filtered datagroupby
to get thecumcount
of "id" and "Date" and set the "unit" accordingly
df["Date"] = pd.to_datetime(df["Date"].str.replace(r"(Q\d).(\d )", r"\2-\1",regex=True))
output = pd.concat([df, df[df["Date"].ge(pd.Timestamp("2022-10-01"))&df["type"].eq("live")]], ignore_index=True)
output["unit"] = output["set"] output.groupby(["id", "Date"]).cumcount().add(1).astype(str).str.zfill(2)
output = output.sort_values("id", ignore_index=True)
#convert Date back to original format if needed
output["Date"] = output["Date"].dt.to_period("Q").astype(str).str.replace(r"\d\d(\d )(Q\d)",r"\2.\1",regex=True)
>>> output
id Date set type unit energy
0 aa Q4.22 l non l01 20
1 aa Q4.22 l live l02 20
2 aa Q4.22 l live l03 20
3 aa Q4.22 l live l04 20
4 aa Q4.22 l live l05 20
5 ba Q3.22 l non l01 20
6 bb Q4.22 l live l01 20
7 bb Q4.22 l live l02 20
8 bb Q4.22 l live l03 20
9 bb Q4.22 l live l04 20
10 cc Q3.22 l non l01 20
CodePudding user response:
First, as noted in the comments, we need to convert some of the df columns into more convenient types:
int
forunit
(stripping any chars),pd.Period
for theDate
.
df2 = df.assign(
unit=df['unit'].str.extract(r'(\d )').astype(int),
period=df['Date'].str.replace(r'^(Q\d)\D*(\d )$', r'\2\1', regex=True).apply(pd.Period)
)
>>> df2
id Date set type unit energy period
0 bb Q4.22 l live 1 20 2022Q4
1 bb Q4.22 l live 2 20 2022Q4
2 ba Q3.22 l non 1 20 2022Q3
3 aa Q4.22 l non 1 20 2022Q4
4 aa Q4.22 l live 1 20 2022Q4
5 cc Q3.22 l non 1 20 2022Q3
6 aa Q4.22 l live 2 20 2022Q4
>>> df2.dtypes
id object
Date object
set object
type object
unit int64
energy int64
period period[Q-DEC]
dtype: object
With this done, now we can proceed with the logic of the question itself.
ix_repeat = (df2['period'] >= pd.Period('2022-Q4')) & (df2['type'] == 'live')
r = df2.loc[ix_repeat]
r.assign(unit=r['unit'] r.groupby(['id', 'period'])['unit'].transform(max))
>>> r
id Date set type unit energy period
0 bb Q4.22 l live 3 20 2022Q4
1 bb Q4.22 l live 4 20 2022Q4
4 aa Q4.22 l live 3 20 2022Q4
6 aa Q4.22 l live 4 20 2022Q4
# finally
df2 = pd.concat([df2, r])
Optional: bringing back unit into its weird string version:
df2 = df2.assign(unit=df2['set'] df2['unit'].astype(str).str.zfill(2))
>>> df2
id Date set type unit energy period
0 bb Q4.22 l live l01 20 2022Q4
1 bb Q4.22 l live l02 20 2022Q4
2 ba Q3.22 l non l01 20 2022Q3
3 aa Q4.22 l non l01 20 2022Q4
4 aa Q4.22 l live l01 20 2022Q4
5 cc Q3.22 l non l01 20 2022Q3
6 aa Q4.22 l live l02 20 2022Q4
0 bb Q4.22 l live l03 20 2022Q4
1 bb Q4.22 l live l04 20 2022Q4
4 aa Q4.22 l live l03 20 2022Q4
6 aa Q4.22 l live l04 20 2022Q4