Home > Enterprise >  Python Pandas Dataframe: Duplicating data in one row to multiple rows with the same value and mergin
Python Pandas Dataframe: Duplicating data in one row to multiple rows with the same value and mergin

Time:08-03

I've been trying to create a script to clean up some data but I'm having trouble and have been unable to find the information. I could also be searching for the wrong thing.

I've got a CSV with data that contains a large number of columns and rows. I'm trying to merge rows together. There should only be 1 line for each 'Repeat_Instance' and the empty 'Repeat_Instance' row needs to be in each other row. I've mostly got it working by using a pandas groupby command except for the first row of an ID needs to be copied to each subsequent row with the same ID.

Here is a short example of the dataframe/table:

Record_ID Repeat_Name Repeat_Instance DataInFirstRepeatName Data_In_check_current Data_In_Assessment
1 A
1 check_current 1 B
1 check_current 2 C
1 assessment 1 D
2 Z
2 check_current 1 Y
2 assessment 1 X
2 assessment 2 V

In the end, it should look like this:

Record_ID Repeat_Name Repeat_Instance DataInFirstRepeatName Data_In_check_current Data_In_Assessment
1 check_current 1 A B D
1 check_current 2 A C
2 check_current 1 Z Y X
2 assessment 2 Z V

For actual code, I have this:

import pandas as pd
import numpy as np

df = pd.read_csv('Test.csv')

df_joined = df.groupby(['Record_ID', 'Repeat_Instance'], sort=False).max().reset_index()

df_joined.to_csv('complete.csv', index=False)

I've been unsuccessful with copying that first row of a Record_ID to each row. I thought using pandas concat may be the solution but I've had no luck. https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Any help or pointers would be greatly appreciated.

CodePudding user response:

You can propagate the DataInFirstRepeatName information in the first row to the others using the forward fill function ffill. When this is done, simply drop the first row in each group (can be done using e.g. dropna on a suitable column).

df['DataInFirstRepeatName'] = df.groupby('Record_ID')['DataInFirstRepeatName'].ffill()
df = df.dropna(subset=['Repeat_Name'])

Then apply the other row merging logic (here with first):

df.groupby(['Record_ID', 'Repeat_Instance']).first().reset_index()

Result:

   Record_ID  Repeat_Instance    Repeat_Name DataInFirstRepeatName Data_In_check_current Data_In_Assessment
0          1              1.0  check_current                     A                     B                  D
1          1              2.0  check_current                     A                     C               None
2          2              1.0  check_current                     Z                     Y                  X
3          2              2.0     assessment                     Z                  None                  V
  • Related