I have two pandas DataFrames, df1
and df2
.
As shown in the first DataFrame df1
, the entries in the column "Box", which show the assignment of different boxes to a hand of people, can contain RegEx-values:
df1:
Person Box
0 Alex Box 1
1 Linda Box 3
2 David Box .*
3 Rachel Box [1-2]
df2:
Box Item Qty.
0 Box 1 Apple 4
1 Box 1 Blueberry 12
2 Box 2 Lemon 1
3 Box 2 Papaya 2
4 Box 3 Apple 2
How do I join the DataFrames on the common column "Box" in a way that these regEx-values are evaluated correctly (as regex are not supported in pandas' join / merge-functions), so that I receive the following DataFrame as a result:
Person Box Item Qty.
0 Alex Box 1 Apple 4
1 Alex Box 1 Blueberry 12
2 Linda Box 3 Apple 2
3 David Box 1 Apple 4
4 David Box 1 Blueberry 12
5 David Box 2 Lemon 1
6 David Box 2 Papaya 2
7 David Box 3 Apple 2
8 Rachel Box 1 Apple 4
9 Rachel Box 1 Blueberry 12
10 Rachel Box 2 Lemon 1
11 Rachel Box 2 Papaya 2
I already tried to achieve this via list comprehensions, which leads to correct results, but drops columns of the left DataFrame.
def joinWithRegEx(left: pd.DataFrame, right: pd.DataFrame, left_on: str, right_on: str):
df = pd.DataFrame
df = pd.concat([right[right[right_on].str.match(entry)] for entry in left[left_on]], ignore_index=True)
'''
Left-Join of two DataFrame with considered Rege
'''
return df
I would have actually thought that this would be a more common usecase, or is Pandas simply not the best choice for these kind if tasks?
CodePudding user response:
Using df.iterrows
it is possible to store the remaining columns of the left dataframe and reassign them after applying the REGEX filter.
See verbose explanation in the commentaries of the updated function.
def joinWithRegEx(left: pd.DataFrame, right: pd.DataFrame, left_on: str, right_on: str):
df = pd.DataFrame()
# iterate over the left dataframe
for idx, row in left.iterrows():
# store REGEX pattern
regex = row[left_on]
# store remaining columns
preserved_values = row.drop(left_on)
# filter the right dataframe
match = right[right[right_on].str.match(regex)]
# add the preserved remaining columns
match = match.assign(**preserved_values)
# append to the final dataframe
df = pd.concat([df, match], ignore_index=True)
'''
Left-Join of two DataFrame with considered Regex
'''
return df
CodePudding user response:
There are several other answers for this use case. Borrowing from: Can I perform a left join/merge between two dataframes using regular expressions with pandas?
import pandas as pd
df1 = pd.DataFrame([
['Alex','Box 1'],
['Linda','Box 3'],
['David','Box .*'],
['Rachel','Box [1-2]'],
], columns=['Person','Box'])
df2 = pd.DataFrame([
['Box 1','Apple',4],
['Box 1','Blueberry',12],
['Box 2','Lemon',1],
['Box 2','Papaya',2],
['Box 3','Apple',2]
],columns=['Box','Item','Qty.'])
import re
def merge_regex(df1, df2,left_on,right_on):
idx = [(i,j) for i,r in enumerate(left_on) for j,v in enumerate(right_on) if re.match(r,v)]
df1_idx, df2_idx = zip(*idx)
t = df1.iloc[list(df1_idx)].reset_index(drop=True)
t1 = df2.iloc[list(df2_idx)].reset_index(drop=True)
return pd.concat([t,t1],axis=1)
df = merge_regex(df1,df2,df1.Box,df2.Box)
#remove duplicate column
df = df.T.groupby(level=0).last().T
#reorder to match desired output
df = df[['Person','Box','Item','Qty.']]
df
Person Box Item Qty.
0 Alex Box 1 Apple 4
1 Alex Box 1 Blueberry 12
2 Linda Box 3 Apple 2
3 David Box 1 Apple 4
4 David Box 1 Blueberry 12
5 David Box 2 Lemon 1
6 David Box 2 Papaya 2
7 David Box 3 Apple 2
8 Rachel Box 1 Apple 4
9 Rachel Box 1 Blueberry 12
10 Rachel Box 2 Lemon 1
11 Rachel Box 2 Papaya 2