Suppose we have two dataframes.
original_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
1 | a | 10.42 |
2 | b | 1.27 |
3 | b | 6.32 |
4 | a | 5.91 |
jumbled_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
11 | b | 6.43 |
12 | b | 1.26 |
13 | a | 9.98 |
14 | a | 15.84 |
15 | a | 6.01 |
Then I want to perform a matching on this data so that I end up with a 1-1 correspondence between them. Where the matching maximises the size of the matching and minimises the difference in fuzzy_criteria. I.e the matching would be
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
1 | 10.42 | a | 9.98 | 13 | 0.44 |
2 | 1.27 | b | 1.26 | 12 | 0.01 |
3 | 6.32 | b | 6.43 | 11 | 0.11 |
4 | 5.91 | a | 6.01 | 15 | 0.1 |
EDIT:
To highlight the need for a maximal matching consider the following example:
original_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
1 | a | 1 |
2 | a | 2 |
jumbled_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
13 | a | 1.9 |
14 | a | 2.9 |
Then a matching would provide (sorted by minimal diff):
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
1 | 1 | a | 1.9 | 13 | 0.9 |
2 | 2 | a | 2.9 | 14 | 0.9 |
1 | 1 | a | 2.9 | 14 | 1.9 |
then removing duplicates in sequence_number_original would provide the following
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
1 | 1 | a | 1.9 | 13 | 0.9 |
then in sequence_number_jumbled
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
Equally the other way round would do the same. First sequence_number_jumbled ...
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
2 | 2 | a | 2.9 | 14 | 0.9 |
Then sequence_number_original...
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
However this is not maximal as there is the following:
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
1 | 1 | a | 1.9 | 13 | 0.9 |
2 | 2 | a | 2.9 | 14 | 0.9 |
There are maximal matching algorithms in graph theory. I did actually just see this other post that is similar to mine.
CodePudding user response:
If there are no duplicated values on both fuzzy_criteria
columns. You can create an auxiliary dataframe to determine the nearest value between two fuzzy_criteria
columns.
from itertools import product
df = pd.DataFrame(sorted(product(original_data['fuzzy_criteria'], jumbled_data['fuzzy_criteria']), key=lambda t: abs(t[0]-t[1])))
df = df.drop_duplicates(0, keep='first')
df = df.drop_duplicates(1, keep='first')
print(df)
0 1
0 1.27 1.26
1 5.91 6.01
2 6.32 6.43
4 10.42 9.98
Then use this auxiliary dataframe to merge these two dataframe separately and finally merge the merged dataframes based on auxiliary dataframe columns.
df_ = pd.merge(
(pd.merge(original_data, df, left_on='fuzzy_criteria', right_on=0)),
(pd.merge(df, jumbled_data, left_on=1, right_on='fuzzy_criteria')),
on=[0,1],
suffixes=('_original', '_jumbled')
).drop([0, 1], axis=1)
df_['fuzz_diff'] = (df_['fuzzy_criteria_original'] - df_['fuzzy_criteria_jumbled']).abs()
sequence_number_original fixed_criteria_original fuzzy_criteria_original \
0 1 a 10.42
1 2 b 1.27
2 3 b 6.32
3 4 a 5.91
sequence_number_jumbled fixed_criteria_jumbled fuzzy_criteria_jumbled \
0 13 a 9.98
1 12 b 1.26
2 11 b 6.43
3 15 a 6.01
fuzz_diff
0 0.44
1 0.01
2 0.11
3 0.10
CodePudding user response:
This is largely copied from @SpghttCd answer to How to get the most pairs out of my pandas dataframe?
The idea is to use networkx to perform a maximal matching.
import pandas as pd
import networkx as nx
# Data input
original_data = pd.DataFrame({
'sequence_number' : [1,2,3,4],
'fixed_criteria' : ['a','b','b','a'],
'fuzzy_criteria' : [10.42, 1.27, 6.32, 5.91]
})
jumbled_data = pd.DataFrame({
'sequence_number' : [11,12,13,14,15],
'fixed_criteria' : ['b','b','a','a','a'],
'fuzzy_criteria' : [6.43, 1.26, 9.98, 15.84, 6.01]
})
# Merge along fixed criteria
joined_data = pd.merge(
original_data,
jumbled_data,
how = 'inner',
on = ['fixed_criteria'],
suffixes=['_original','_jumbled']
)
# To use max weight, take the reciricol of the difference (if they are the non-
# unique values this will have to be changed)
joined_data['weight'] = (1/abs(
joined_data['fuzzy_criteria_original'] -
joined_data['fuzzy_criteria_jumbled']
))
# Form graph
matching_graph = nx.from_pandas_edgelist(
joined_data,
source = 'sequence_number_original',
target = 'sequence_number_jumbled',
edge_attr = 'weight'
)
# Find matching
mathing = nx.max_weight_matching(
matching_graph,
weight = 'weight'
)
# Convert results back into dataframe and format
results = pd.DataFrame(
list(mathing),
columns=['sequence_number_original', 'sequence_number_jumbled']
)
results = pd.merge(
results,
joined_data,
how = 'inner',
on = ['sequence_number_original', 'sequence_number_jumbled'],
)
results['fuzzy_difference'] = abs(
results['fuzzy_criteria_original'] -
results['fuzzy_criteria_jumbled']
)
print(results)