Merge two dataframes groupby the column values of a dataframe-CodePudding

I have the following dataframe:

df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home","Stay home", 'Go outside', "Go Outside","Go outside"],
                    'Child' : ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "sunny"]})

    Parent      Child
0   Stay home   Severe weather
1   Stay home   Severe weather
2   Stay home   Severe weather
3   Go outside  Sunny
4   Go Outside  Sunny
5   Go outside  sunny

And a second one:

df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})

             Similarity_Score
0   SimilarityScore:0.43693185876069784
1   SimilarityScore:0.299807821163373

I want to merge the two dataframes based on the values of Child values of df1.

Expected outcome:

     Parent     Child           Similarity_Score
0   Stay home   Severe weather  0.43693185876069784
1   Stay home   Severe weather  0.43693185876069784
2   Stay home   Severe weather  0.43693185876069784
3   Go outside  Sunny           0.299807821163373
4   Go Outside  Sunny           0.299807821163373
5   Go outside  sunny           0.299807821163373

I tried the usual merge and concat methods but could not find the solution. Any ideas?

CodePudding user response：

If you want to assign the scores based on the Child's value, you can do it like that:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home", "Stay home", 'Go outside', "Go Outside", "Go outside"],
                    'Child': ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "Sunny"]})
df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})

# Split the string at : and convert to float
df2['Similarity_Score'] = df2['Similarity_Score'].str.split(':').str[1].astype(float)

# calculate auxiliary column position to base the matching on
df1['position'] = df1['Child'].apply(lambda row: np.where(df1['Child'].unique() == row)[0][0])

# merge both dataframes and drop auxiliary column position
df = df1.merge(df2, left_on='position', right_index=True).drop(columns=["position"])

CodePudding user response：

based on your responses, the merge is based on the index, after taking identifying unique values in df1


# identifying the group
df1['key']=df1.groupby(['Parent','Child']).ngroup(ascending=False)
df1

# merge the two DF, and while merging split the similarity scope to take only numeric part

(df1.merge(df2['Similarity_Score'].str.split(':', expand=True)[1],
          left_on='key', 
          right_index=True)
    .drop(columns='key'))

Parent  Child   1
0   Stay home   Severe weather  0.43693185876069784
1   Stay home   Severe weather  0.43693185876069784
2   Stay home   Severe weather  0.43693185876069784
3   Go outside  Sunny   0.299807821163373
4   Go outside  Sunny   0.299807821163373
5   Go outside  Sunny   0.299807821163373

CodePudding user response：

Is there a specific parameter you want to join based of? Are the scores constant for for Sunny (~0.299) and Severe Weather? If that's the case then make a binary column and merge based on that.