Home > database >  Maximal matching between two pandas dataframes
Maximal matching between two pandas dataframes

Time:04-12

Suppose we have two dataframes.

original_data

sequence_number fixed_criteria fuzzy_criteria
1 a 10.42
2 b 1.27
3 b 6.32
4 a 5.91

jumbled_data

sequence_number fixed_criteria fuzzy_criteria
11 b 6.43
12 b 1.26
13 a 9.98
14 a 15.84
15 a 6.01

Then I want to perform a matching on this data so that I end up with a 1-1 correspondence between them. Where the matching maximises the size of the matching and minimises the difference in fuzzy_criteria. I.e the matching would be

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
1 10.42 a 9.98 13 0.44
2 1.27 b 1.26 12 0.01
3 6.32 b 6.43 11 0.11
4 5.91 a 6.01 15 0.1

EDIT:

To highlight the need for a maximal matching consider the following example:

original_data

sequence_number fixed_criteria fuzzy_criteria
1 a 1
2 a 2

jumbled_data

sequence_number fixed_criteria fuzzy_criteria
13 a 1.9
14 a 2.9

Then a matching would provide (sorted by minimal diff):

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
1 1 a 1.9 13 0.9
2 2 a 2.9 14 0.9
1 1 a 2.9 14 1.9

then removing duplicates in sequence_number_original would provide the following

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
1 1 a 1.9 13 0.9

then in sequence_number_jumbled

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1

Equally the other way round would do the same. First sequence_number_jumbled ...

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
2 2 a 2.9 14 0.9

Then sequence_number_original...

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1

However this is not maximal as there is the following:

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
1 1 a 1.9 13 0.9
2 2 a 2.9 14 0.9

There are maximal matching algorithms in graph theory. I did actually just see this other post that is similar to mine.

CodePudding user response:

If there are no duplicated values on both fuzzy_criteria columns. You can create an auxiliary dataframe to determine the nearest value between two fuzzy_criteria columns.

from itertools import product

df = pd.DataFrame(sorted(product(original_data['fuzzy_criteria'], jumbled_data['fuzzy_criteria']), key=lambda t: abs(t[0]-t[1])))
df = df.drop_duplicates(0, keep='first')
df = df.drop_duplicates(1, keep='first')
print(df)

       0     1
0   1.27  1.26
1   5.91  6.01
2   6.32  6.43
4  10.42  9.98

Then use this auxiliary dataframe to merge these two dataframe separately and finally merge the merged dataframes based on auxiliary dataframe columns.

df_ = pd.merge(
    (pd.merge(original_data, df, left_on='fuzzy_criteria', right_on=0)),
    (pd.merge(df, jumbled_data, left_on=1, right_on='fuzzy_criteria')),
    on=[0,1],
    suffixes=('_original', '_jumbled')
).drop([0, 1], axis=1)
df_['fuzz_diff'] = (df_['fuzzy_criteria_original'] - df_['fuzzy_criteria_jumbled']).abs()
   sequence_number_original fixed_criteria_original  fuzzy_criteria_original  \
0                         1                       a                    10.42
1                         2                       b                     1.27
2                         3                       b                     6.32
3                         4                       a                     5.91

   sequence_number_jumbled fixed_criteria_jumbled  fuzzy_criteria_jumbled  \
0                       13                      a                    9.98
1                       12                      b                    1.26
2                       11                      b                    6.43
3                       15                      a                    6.01

   fuzz_diff
0       0.44
1       0.01
2       0.11
3       0.10

CodePudding user response:

This is largely copied from @SpghttCd answer to How to get the most pairs out of my pandas dataframe?

The idea is to use networkx to perform a maximal matching.

import pandas as pd
import networkx as nx

# Data input

original_data = pd.DataFrame({
    'sequence_number' : [1,2,3,4],
    'fixed_criteria' : ['a','b','b','a'],
    'fuzzy_criteria' : [10.42, 1.27, 6.32, 5.91]
})

jumbled_data = pd.DataFrame({
    'sequence_number' : [11,12,13,14,15],
    'fixed_criteria' : ['b','b','a','a','a'],
    'fuzzy_criteria' : [6.43, 1.26, 9.98, 15.84, 6.01]
})

# Merge along fixed criteria

joined_data = pd.merge(
    original_data,
    jumbled_data,
    how = 'inner',
    on = ['fixed_criteria'],
    suffixes=['_original','_jumbled']
)

# To use max weight, take the reciricol of the difference (if they are the non-
# unique values this will have to be changed)

joined_data['weight'] = (1/abs(
    joined_data['fuzzy_criteria_original'] -
    joined_data['fuzzy_criteria_jumbled']
))

# Form graph

matching_graph = nx.from_pandas_edgelist(
    joined_data,
    source = 'sequence_number_original',
    target = 'sequence_number_jumbled',
    edge_attr = 'weight'
)

# Find matching

mathing = nx.max_weight_matching(
    matching_graph,
    weight = 'weight'
)

# Convert results back into dataframe and format

results = pd.DataFrame(
    list(mathing),
    columns=['sequence_number_original', 'sequence_number_jumbled']
)

results = pd.merge(
    results,
    joined_data,
    how = 'inner',
    on = ['sequence_number_original', 'sequence_number_jumbled'],
)

results['fuzzy_difference'] = abs(
    results['fuzzy_criteria_original'] -
    results['fuzzy_criteria_jumbled']
)

print(results)
  • Related