Home > database >  Python: When joining between python daframes, is it possible to match the included key value instead
Python: When joining between python daframes, is it possible to match the included key value instead

Time:10-21

When joining between python daframes, is it possible to match the included key value instead of the same key value?

For example

A B C D E
ABC 1 2 3 4
DEF 5 6 7 8
GHI 9 10 11 12

df_1 like this. And

F G H I J
AB 13 14 15 16
DE 17 18 19 20
GH 21 22 23 24

df_2 like this.

A B C D E G H I J
ABC 1 2 3 4 13 14 15 16
DEF 5 6 7 8 17 18 19 20
GHI 9 10 11 12 21 22 23 24

Like this data frame, I want to combine columns A and F based on them. Although it is not the same KEY, the column value of F contains the column value of A, and I want to match it based on it.

Can I put conditions for key matching when I try Join? I converted each data frame in dictionary form, and I combined the data frames by matching the included KEY.

However, this data frame lost all the column names as shown in the table above, so I couldn't provide the results I wanted.

If it's not JOIN's method, is there a different way to combine the two data frames?

CodePudding user response:

One solution - although a very slow one - would be to iterate through both dataframes, comparing substrings that exist in both dataframes. Here's how the code would look like:

import pandas as pd


def string_match(
    df1: pd.DataFrame, df2: pd.DataFrame, left_on: str, right_on: str
):
    final = []
    for i, r in df1.iterrows():
        matches = [
            r2
            for i2, r2 in df2.iterrows()
            if any(
                [
                    *[char for char in r[left_on] if char in r2[right_on]],
                    *[char for char in r2[right_on] if char in r[left_on]],
                ]
            )
        ]
        for match in matches:
            res = pd.concat([r, match], axis=0)
            final.append(res)
    return pd.DataFrame(final)


# == Example ==========================================
df = pd.DataFrame(
    [
        ["ABC", 1, 2, 3, 4],
        ["DEF", 5, 6, 7, 8],
        ["GHI", 9, 10, 11, 12],
    ],
    columns=["A", "B", "C", "D", "E"],
)

df2 = pd.DataFrame(
    [
        ["AB", 13, 14, 15, 16],
        ["DE", 17, 18, 19, 20],
        ["GH", 21, 22, 23, 24],
        ["XX", 21, 22, 23, 24],
    ],
    columns=["F", "G", "H", "I", "J"],
)
df3 = string_match(df, df2, 'A', 'F')
print(df3)
# Prints:
#      A  B   C   D   E   F   G   H   I   J
# 0  ABC  1   2   3   4  AB  13  14  15  16
# 1  DEF  5   6   7   8  DE  17  18  19  20
# 2  GHI  9  10  11  12  GH  21  22  23  24

A More Robust Implementation

If you want to make string_match function more robust, you could add some parameters checks in the beginning of the function:

def string_match(
    df1: pd.DataFrame, df2: pd.DataFrame, left_on: str, right_on: str
):
    err1, err2 = None, None
    if not isinstance(left_on, str):
        err1 = ValueError(
            "`left_on` needs to be a column name from `df1` dataframe, not"
            f" {type(left_on)}"
        )
    elif not left_on in df1.columns:
        err1 = KeyError(
            f'`left_on` "{left_on}" is not a column from `df1`. Existing'
            " columns: "
              ", ".join(df1.columns.astype(str).to_list())
        )
    if not isinstance(right_on, str):
        err2 = ValueError(
            "`right_on` needs to be a column name from `df2` dataframe, not"
            f" {type(right_on)}"
        )
    elif not right_on in df2.columns:
        err2 = KeyError(
            f'`right_on` "{right_on}" is not a column from `df2`. Existing'
            " columns: "
              ", ".join(df2.columns.astype(str).to_list())
        )
    if err1 is not None or err2 is not None:
        err, _err = (err1, err2) if err1 is not None else (err2, err1)
        raise err from _err

    final = []
    for i, r in df1.iterrows():
        matches = [
            r2
            for i2, r2 in df2.iterrows()
            if any(
                [
                    *[char for char in r[left_on] if char in r2[right_on]],
                    *[char for char in r2[right_on] if char in r[left_on]],
                ]
            )
        ]
        for match in matches:
            res = pd.concat([r, match], axis=0)
            final.append(res)
    return pd.DataFrame(final)


CodePudding user response:

The question is if the rows in F are always the first two characters of their corresponding A values in the other dataframe. Since my comment asking about this apparently wasn't clear, to clarify: With your two dataframes,

df1 = pd.DataFrame({'A': ['ABC', 'DEF', 'GHI'], 
                    'B': ['1', '5', '9'], 
                    'C': ['2', '6', '10'], 
                    'D': ['3', '7', '11'], 
                    'E': ['4', '8', '12']})

df2 = pd.DataFrame({'F': ['AB', 'DE', 'GH'], 
                    'G': ['13', '17', '21'], 
                    'H': ['14', '18', '22'], 
                    'I': ['15', '19', '23'], 
                    'J': ['16', '20', '24']})

you can get the desired result by merging on the first two characters of each row in df1.A and df2.F, at least that's what the pattern in your example suggests and you didn't specify otherwise.

>>> df1.merge(
        df2, 
        left_on=df1.A.str[:2], 
        right_on="F"
    )
    .drop("F", axis=1)
     A  B   C   D   E   G   H   I   J
0  ABC  1   2   3   4  13  14  15  16
1  DEF  5   6   7   8  17  18  19  20
2  GHI  9  10  11  12  21  22  23  24
  • Related