When joining between python daframes, is it possible to match the included key value instead of the same key value?
For example
A | B | C | D | E |
---|---|---|---|---|
ABC | 1 | 2 | 3 | 4 |
DEF | 5 | 6 | 7 | 8 |
GHI | 9 | 10 | 11 | 12 |
df_1 like this. And
F | G | H | I | J |
---|---|---|---|---|
AB | 13 | 14 | 15 | 16 |
DE | 17 | 18 | 19 | 20 |
GH | 21 | 22 | 23 | 24 |
df_2 like this.
A | B | C | D | E | G | H | I | J |
---|---|---|---|---|---|---|---|---|
ABC | 1 | 2 | 3 | 4 | 13 | 14 | 15 | 16 |
DEF | 5 | 6 | 7 | 8 | 17 | 18 | 19 | 20 |
GHI | 9 | 10 | 11 | 12 | 21 | 22 | 23 | 24 |
Like this data frame, I want to combine columns A and F based on them. Although it is not the same KEY, the column value of F contains the column value of A, and I want to match it based on it.
Can I put conditions for key matching when I try Join? I converted each data frame in dictionary form, and I combined the data frames by matching the included KEY.
However, this data frame lost all the column names as shown in the table above, so I couldn't provide the results I wanted.
If it's not JOIN's method, is there a different way to combine the two data frames?
CodePudding user response:
One solution - although a very slow one - would be to iterate through both dataframes, comparing substrings that exist in both dataframes. Here's how the code would look like:
import pandas as pd
def string_match(
df1: pd.DataFrame, df2: pd.DataFrame, left_on: str, right_on: str
):
final = []
for i, r in df1.iterrows():
matches = [
r2
for i2, r2 in df2.iterrows()
if any(
[
*[char for char in r[left_on] if char in r2[right_on]],
*[char for char in r2[right_on] if char in r[left_on]],
]
)
]
for match in matches:
res = pd.concat([r, match], axis=0)
final.append(res)
return pd.DataFrame(final)
# == Example ==========================================
df = pd.DataFrame(
[
["ABC", 1, 2, 3, 4],
["DEF", 5, 6, 7, 8],
["GHI", 9, 10, 11, 12],
],
columns=["A", "B", "C", "D", "E"],
)
df2 = pd.DataFrame(
[
["AB", 13, 14, 15, 16],
["DE", 17, 18, 19, 20],
["GH", 21, 22, 23, 24],
["XX", 21, 22, 23, 24],
],
columns=["F", "G", "H", "I", "J"],
)
df3 = string_match(df, df2, 'A', 'F')
print(df3)
# Prints:
# A B C D E F G H I J
# 0 ABC 1 2 3 4 AB 13 14 15 16
# 1 DEF 5 6 7 8 DE 17 18 19 20
# 2 GHI 9 10 11 12 GH 21 22 23 24
A More Robust Implementation
If you want to make string_match
function more robust, you could add some parameters checks in the beginning of the function:
def string_match(
df1: pd.DataFrame, df2: pd.DataFrame, left_on: str, right_on: str
):
err1, err2 = None, None
if not isinstance(left_on, str):
err1 = ValueError(
"`left_on` needs to be a column name from `df1` dataframe, not"
f" {type(left_on)}"
)
elif not left_on in df1.columns:
err1 = KeyError(
f'`left_on` "{left_on}" is not a column from `df1`. Existing'
" columns: "
", ".join(df1.columns.astype(str).to_list())
)
if not isinstance(right_on, str):
err2 = ValueError(
"`right_on` needs to be a column name from `df2` dataframe, not"
f" {type(right_on)}"
)
elif not right_on in df2.columns:
err2 = KeyError(
f'`right_on` "{right_on}" is not a column from `df2`. Existing'
" columns: "
", ".join(df2.columns.astype(str).to_list())
)
if err1 is not None or err2 is not None:
err, _err = (err1, err2) if err1 is not None else (err2, err1)
raise err from _err
final = []
for i, r in df1.iterrows():
matches = [
r2
for i2, r2 in df2.iterrows()
if any(
[
*[char for char in r[left_on] if char in r2[right_on]],
*[char for char in r2[right_on] if char in r[left_on]],
]
)
]
for match in matches:
res = pd.concat([r, match], axis=0)
final.append(res)
return pd.DataFrame(final)
CodePudding user response:
The question is if the rows in F
are always the first two characters of their corresponding A
values in the other dataframe. Since my comment asking about this apparently wasn't clear, to clarify: With your two dataframes,
df1 = pd.DataFrame({'A': ['ABC', 'DEF', 'GHI'],
'B': ['1', '5', '9'],
'C': ['2', '6', '10'],
'D': ['3', '7', '11'],
'E': ['4', '8', '12']})
df2 = pd.DataFrame({'F': ['AB', 'DE', 'GH'],
'G': ['13', '17', '21'],
'H': ['14', '18', '22'],
'I': ['15', '19', '23'],
'J': ['16', '20', '24']})
you can get the desired result by merging on the first two characters of each row in df1.A
and df2.F
, at least that's what the pattern in your example suggests and you didn't specify otherwise.
>>> df1.merge(
df2,
left_on=df1.A.str[:2],
right_on="F"
)
.drop("F", axis=1)
A B C D E G H I J
0 ABC 1 2 3 4 13 14 15 16
1 DEF 5 6 7 8 17 18 19 20
2 GHI 9 10 11 12 21 22 23 24