I've been trying to create a script to clean up some data but I'm having trouble and have been unable to find the information. I could also be searching for the wrong thing.
I've got a CSV with data that contains a large number of columns and rows. I'm trying to merge rows together. There should only be 1 line for each 'Repeat_Instance' and the empty 'Repeat_Instance' row needs to be in each other row. I've mostly got it working by using a pandas groupby
command except for the first row of an ID needs to be copied to each subsequent row with the same ID.
Here is a short example of the dataframe/table:
Record_ID | Repeat_Name | Repeat_Instance | DataInFirstRepeatName | Data_In_check_current | Data_In_Assessment |
---|---|---|---|---|---|
1 | A | ||||
1 | check_current | 1 | B | ||
1 | check_current | 2 | C | ||
1 | assessment | 1 | D | ||
2 | Z | ||||
2 | check_current | 1 | Y | ||
2 | assessment | 1 | X | ||
2 | assessment | 2 | V |
In the end, it should look like this:
Record_ID | Repeat_Name | Repeat_Instance | DataInFirstRepeatName | Data_In_check_current | Data_In_Assessment |
---|---|---|---|---|---|
1 | check_current | 1 | A | B | D |
1 | check_current | 2 | A | C | |
2 | check_current | 1 | Z | Y | X |
2 | assessment | 2 | Z | V |
For actual code, I have this:
import pandas as pd
import numpy as np
df = pd.read_csv('Test.csv')
df_joined = df.groupby(['Record_ID', 'Repeat_Instance'], sort=False).max().reset_index()
df_joined.to_csv('complete.csv', index=False)
I've been unsuccessful with copying that first row of a Record_ID to each row. I thought using pandas concat
may be the solution but I've had no luck. https://pandas.pydata.org/docs/reference/api/pandas.concat.html
Any help or pointers would be greatly appreciated.
CodePudding user response:
You can propagate the DataInFirstRepeatName
information in the first row to the others using the forward fill function ffill
. When this is done, simply drop the first row in each group (can be done using e.g. dropna
on a suitable column).
df['DataInFirstRepeatName'] = df.groupby('Record_ID')['DataInFirstRepeatName'].ffill()
df = df.dropna(subset=['Repeat_Name'])
Then apply the other row merging logic (here with first
):
df.groupby(['Record_ID', 'Repeat_Instance']).first().reset_index()
Result:
Record_ID Repeat_Instance Repeat_Name DataInFirstRepeatName Data_In_check_current Data_In_Assessment
0 1 1.0 check_current A B D
1 1 2.0 check_current A C None
2 2 1.0 check_current Z Y X
3 2 2.0 assessment Z None V