Home > Software design >  Compare 2 Data frames for partial similarities
Compare 2 Data frames for partial similarities

Time:02-24

How do I compare 2 data frames and remove the rows that have similar values?

df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')
combinedf = pd.concat([df, df1], axis=1)

enter image description here

df contains the column 'trace_id', and df1 contains the columns 'index' and 'name'. Notice that trace_id and index are very similar in values, 'TRACE_PERSON_0000000003' and 'PERSON_0000000003' respectively. How do I remove the rows between that have similar trace_id and index values.

Example: trace_id = 'TRACE_PERSON_0000000003' and index = 'PERSON_0000000003', both its trace_id, index and name will be removed. 'PERSON_0000000000' are not found in the trace_id column, so 'PERSON_0000000000' and 'Amy Berger' will be retained in the data frame.

CodePudding user response:

Hard to be sure without example data, but you can:

  1. delete TRACE_ from the trace_id column of df
  2. merge on the trimmed trace_id and index, passing indicator=True to get a merge indicator column named _merge
  3. exclude rows where _merge == 'both'
df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')

# Delete "TRACE_" from `trace_id` column
df['trace_id_trimmed'] = df['trace_id'].str.replace('TRACE_', '')

# Outer merge with indicator column
merged = df.merge(df1, 
                  how='outer', 
                  left_on='trace_id_trimmed', 
                  right_on='index', 
                  indicator=True)

# Exclude rows where merge key was found in both DataFrames
merged = merged[merged['_merge'] != 'both'].copy()
  • Related