I have two dataframes and a rather tricky join to accomplish.
The first dataframe:
data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1
Output:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [100, 101, 102]
1 Standard2 [100, 102]
2 Standard3 [103]
... ... ...
The second one:
data = [[100, 'verahren', ['word1', 'word2']],
[101, 'flaechen', ['word3']],
[102, 'nutzung', ['word4', 'word5']],
[103, 'ort', ['word6', 'word7']]]
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2
Output:
KeyWordGroupID KeyWordGroupName KeyWords
100 verahren ['word1', 'word2']
101 flaechen ['word3']
102 nutzung ['word4', 'word5']
103 ort ['word6', 'word7']
... ... ...
The desired output:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [['word1', 'word2'], ['word3'], ['word4', 'word5']]
1 Standard2 [['word1', 'word2'], ['word4', 'word5']]
2 Standard3 [['word6', 'word7']]
I tried to convert the second dataframe into a dictionary using df.to_dict('records')
and put it into a pandas apply user defined function to match via key values but it doesn't seem like a clean approach.
Does someone has an approach to solve that? Any ideas are rewarded.
CodePudding user response:
I think you have a couple different options
- You can create a dictionary and use
map
- You can convert the lists to a string and use
replace
Option 1
e = df1.explode('KeyWordGroupID') # explode youre frame
# create a dictionary from KeyWords and map it to the KeyWordGroupID
e['KeyWords'] = e['KeyWordGroupID'].map(df2.set_index('KeyWordGroupID')['KeyWords'].to_dict())
# merge df1 with e
new_df = df1.merge(e.groupby('RuleSetID')['KeyWords'].agg(list), right_index=True, left_on='RuleSetID')
RuleSetID RuleSetName KeyWordGroupID \
0 0 Standard1 [100, 101, 102]
1 1 Standard2 [100, 102]
2 2 Standard3 [103]
KeyWords
0 [[word1, word2], [word3], [word4, word5]]
1 [[word1, word2], [word4, word5]]
2 [[word6, word7]]
CodePudding user response:
@Corralien had a great answer with pandas to solve the problem. But here I want to present a cleaner way to do it with datar
, a re-imagination of pandas APIs:
>>> from datar.all import f, unchop, left_join, group_by, summarise
>>>
>>> (
... df1
... >> unchop(f.KeyWordGroupID) # Make KeyWordGroupID one at a row
... >> left_join(df2, by=f.KeyWordGroupID) # Attach df2 by KeyWordGroupIDs
... >> group_by(f.RuleSetID, f.RuleSetName)
... >> summarise(KeyWords = f.KeyWords.agg(pd.Series)) # Concatenate the KeyWords
... )
[2022-03-28 13:52:38][datar][ INFO] `summarise()` has grouped output by ['RuleSetID'] (override with `_groups` argument)
RuleSetID RuleSetName KeyWords
<int64> <object> <object>
0 0 Standard1 [[word1, word2], [word3], [word4, word5]]
1 1 Standard2 [[word1, word2], [word4, word5]]
2 2 Standard3 [word6, word7]
[TibbleGrouped: RuleSetID (n=3)]
Same idea with pandas itself:
(
df1
.explode("KeyWordGroupID")
.merge(df2, how="left", on="KeyWordGroupID")
.groupby(["RuleSetID", "RuleSetName"])
.agg({"KeyWords": pd.Series})
.reset_index()
)
CodePudding user response:
The main idea is to convert df2
as a dict mapping Series
where the key
is the KeyWordGroupID
column and the value
is the KeyWords
column.
You can use explode
to flatten KeyWordGroupID
column of df1
then map
it to df2
then groupby
to reshape your first dataframe:
df1['KeyWordGroupID'] = (
df1['KeyWordGroupID'].explode().map(df2.set_index('KeyWordGroupID')['KeyWords'])
.groupby(level=0).apply(list)
)
print(df1)
# Output
RuleSetID RuleSetName KeyWordGroupID
0 0 Standard1 [[word1, word2], [word3], [word4, word5]]
1 1 Standard2 [[word1, word2], [word4, word5]]
2 2 Standard3 [[word6, word7]]