Home > Back-end >  Create duplicate row in Pandas dataframe based on condition, and change values for a specific column
Create duplicate row in Pandas dataframe based on condition, and change values for a specific column

Time:11-24

I have a Pandas dataframe with multiple columns (suppose in the following format to simplify the example):-

Item        Interval    Specs
Chocolate      0        0.001
Chocolate      3        0.002
Chocolate      5        0.007
Milk           0        0.004
Milk           2        0.008
....

What I would like to do is to create duplicate rows for each of the original rows in the dataframe wherein, the Interval value is 0, and have this for other Interval cases with the value of Interval changed. For instance, the Item Chocolate in the original dataframe has 3 cases with Interval 0, 3 and 5. For the case wherein Interval is 0, I would like to duplicate this rows and keep the Specs value same, but only modify the Interval value and create 2 duplicate copies i.e. with Interval values 3 and 5. Similarly for the Milk Item, I want to create 1 duplicate row with Interval changed to 2. (Number of duplicate rows to be created thus depends directly on the number of unique Interval values for individual Item).

The expected dataframe is:

Item        Interval    Specs
Chocolate      0        0.001
Chocolate      3        0.002
Chocolate      5        0.007
Chocolate      3        0.001   --Duplicate row added and Interval value changed
Chocolate      5        0.001   --Duplicate row added and Interval value changed
Milk           0        0.004 
Milk           2        0.008   
Milk           2        0.004   --Duplicate row added and Interval value changed

A similar question (Duplicate rows based on value with condition) on SO focuses on creating duplicate rows based on condition, but not on changing the values in the dataframe corresponding to the other rows. Any suggestions on how to accomplish this would be much appreciated.

CodePudding user response:

One way of solving this is by creating a second dataframe with all elements which do not have Interval=0

df2 = df[df.Interval != 0]

then map the values of column Specs from the entries with Interval==0 onto column Specs in the new dataframe:

df2.loc[:, 'Specs'] = df2['Item'].map(df[df.Interval == 0].set_index('Item')['Specs'])

and concatenate the 2 dataframes in the end

df = pd.concat([df, df2], axis=0)

This will give you the desired output.

CodePudding user response:

We can do this by first creating 2 dataframes:

  1. DF with rows having interval as 0
  2. DF with rows having interval as not 0

Then we can create a dictionary of the dataframe which has interval 0. Using this dictionary we can update the other dataframe's specs value. Then the updated dataframe is merged with the original dataframe. The code is given below-

rows = [['Chocolate', 0, 0.001],
    ['Chocolate', 3, 0.002],
    ['Chocolate', 5, 0.007],
    ['Milk', 0, 0.004],
    ['Milk', 2, 0.008]]

df = pd.DataFrame(rows, columns=['Item', 'Interval', 'Specs']) # original dataframe

df1 = df[df['Interval'].ne(0)].copy() # dataframe with rows having interval other than 0
df2 = df[df['Interval'].eq(0)].copy() # dataframe with rows having interval 0

di = pd.Series(df2['Specs'].values, index=df2['Item']).to_dict()

for key, val in di.items():
    df1.loc[df1.Item == key, 'Specs'] = val

res = pd.concat([df, df1], ignore_index=True).sort_values('Item')
  • Related