How do I compare 2 data frames and remove the rows that have similar values?
df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')
combinedf = pd.concat([df, df1], axis=1)
df contains the column 'trace_id', and df1 contains the columns 'index' and 'name'. Notice that trace_id and index are very similar in values, 'TRACE_PERSON_0000000003' and 'PERSON_0000000003' respectively. How do I remove the rows between that have similar trace_id and index values.
Example: trace_id = 'TRACE_PERSON_0000000003' and index = 'PERSON_0000000003', both its trace_id, index and name will be removed. 'PERSON_0000000000' are not found in the trace_id column, so 'PERSON_0000000000' and 'Amy Berger' will be retained in the data frame.
CodePudding user response:
Hard to be sure without example data, but you can:
- delete
TRACE_
from thetrace_id
column ofdf
- merge on the trimmed
trace_id
andindex
, passingindicator=True
to get a merge indicator column named_merge
- exclude rows where
_merge == 'both'
df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')
# Delete "TRACE_" from `trace_id` column
df['trace_id_trimmed'] = df['trace_id'].str.replace('TRACE_', '')
# Outer merge with indicator column
merged = df.merge(df1,
how='outer',
left_on='trace_id_trimmed',
right_on='index',
indicator=True)
# Exclude rows where merge key was found in both DataFrames
merged = merged[merged['_merge'] != 'both'].copy()