I have a Pandas dataframe with three columns like that :
Time | Code | Id |
---|---|---|
10:10:00 | Rx | 11 |
10:10:01 | Tx | 11 |
10:10:02 | Rx | 12 |
10:10:04 | Tx | 12 |
10:10:06 | Rx | 13 |
10:10:07 | Tx | 13 |
10:10:08 | Rx | 11 |
10:10:10 | Rx | 11 |
I want to check if for a Rx
code if there is a Tx
code just after and if the id
is same for the Rx
and Tx
.
I want to get the row of duplicate Rx
if there is.
In my example I want to throw the 10:10:10 Rx because it's duplicated.
I managed to do with for loop but I should'nt use for loop with Data Frame
old_cell = None
for index, row in pdo_df.iterrows():
if old_cell is None:
old_cell = row
if row['Function_code'] == old_cell['Function_code']:
print("----------------")
print("Error :")
print(old_cell)
print(row)
print("----------------")
old_cell = row
CodePudding user response:
The method shift help you look at the value of the last row. This code detect then all the duplicates :
df[
(df["Code"] == df["Code"].shift()) &
(df["Id"] == df["Id"].shift())
]
Following the same logic, if we take the opposite of the last code, you have your dataframe without those duplicates :
df[
~((df["Code"] == df["Code"].shift()) &
(df["Id"] == df["Id"].shift()) )
]
CodePudding user response:
If this is based on the Code
column we can do this in two steps, first create a cumultive count over your Code
column then a simple sum to measure equality.
s = df.groupby('Code').cumcount()
s1 = (s.groupby(s.index // 2).transform('sum') % 2 > 0)
df1 = pd.concat([df[~s1], df[s1].drop_duplicates(subset='Code',keep='first')])
print(df1)
Time Code Id
0 10:10:00 Rx 11
1 10:10:01 Tx 11
2 10:10:02 Rx 12
3 10:10:04 Tx 12
4 10:10:06 Rx 13
5 10:10:07 Tx 13
6 10:10:08 Rx 11
digging into the steps:
print(s)
0 0
1 0
2 1
3 1
4 2
5 2
6 3
7 4
print(s1)
0 False
1 False
2 False
3 False
4 False
5 False
6 True # <-- non-matching duplicate one
7 True # <-- non-matching duplicate two
dtype: bool