Get multiple column value based on partial matching with another column value for pandas dataframe-CodePudding

I have a following dataframe:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]',  '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]','[email protected]']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

However, I want to keep only those values for property named "emails" whose every element's value after '@' is same as the corresponding value of the 'main_url' property but after "http://" resulting the following data frame:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

Any hints or approach is appreciable considering the fact that I have millions row to implement the transformation

CodePudding user response：

Give this a try I think it should be able to handle a few millions of rows.

def list_check(emails_list, email_match):
    match_indexes = [i for i, s in enumerate(emails_list) if email_match in s]
    return [emails_list[index] for index in match_indexes]

# Parse main_url to get domain column
df['domain'] = list(map(lambda x: x.split('//')[1], df['main_url']))

# Apply list_check to your dataframe using emails and domain columns
df['emails'] = list(map(lambda x, y: list_check(x, y), df['emails'], df['domain']))

# Drop domain column
df.drop(columns=['domain'], inplace=True)

list_check function checks whether your match string is in the emails list and gets indexes of matches, then gets values from the emails list using matched indexes and returns those values in a list.

Output:

source for getting matched indexes

CodePudding user response：

You can try to explode the mails column then compare with main_url column

df_ = URL_WITH_EMAILS_DF.explode('emails').reset_index()
m = (df_['main_url'].str.split('//').str[1] == df_['emails'].str.split('@').str[1])
df_ = df_[m].groupby('main_url').agg({'emails':lambda x: list(x)})

print(df_)
                                                  emails
main_url
http://keilstruplund.dk      [[email protected]]
http://kirsebaergaarden.com  [[email protected]]
http://koglernes.dk                  [[email protected]]

At last, map the result to original dataframe

URL_WITH_EMAILS_DF['matched'] = URL_WITH_EMAILS_DF['main_url'].map(df_['emails']).fillna("").apply(list)

print(URL_WITH_EMAILS_DF[['main_url', 'matched']])

                         main_url                      matched
0         http://keilstruplund.dk  [[email protected]]
1     http://kirsebaergaarden.com  [[email protected]]
2             http://koglernes.dk          [[email protected]]
3  http://kongehojensbornehave.dk                           []