I have a following dataframe:
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},
{'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]','[email protected]']},
{'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
])
However, I want to keep only those values for property named "emails" whose every element's value after '@' is same as the corresponding value of the 'main_url' property but after "http://" resulting the following data frame:
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]']},
{'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
{'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
])
Any hints or approach is appreciable considering the fact that I have millions row to implement the transformation
CodePudding user response:
Give this a try I think it should be able to handle a few millions of rows.
def list_check(emails_list, email_match):
match_indexes = [i for i, s in enumerate(emails_list) if email_match in s]
return [emails_list[index] for index in match_indexes]
# Parse main_url to get domain column
df['domain'] = list(map(lambda x: x.split('//')[1], df['main_url']))
# Apply list_check to your dataframe using emails and domain columns
df['emails'] = list(map(lambda x, y: list_check(x, y), df['emails'], df['domain']))
# Drop domain column
df.drop(columns=['domain'], inplace=True)
list_check
function checks whether your match string is in the emails list and gets indexes of matches, then gets values from the emails list using matched indexes and returns those values in a list.
Output:
source for getting matched indexes
CodePudding user response:
You can try to explode
the mails
column then compare with main_url
column
df_ = URL_WITH_EMAILS_DF.explode('emails').reset_index()
m = (df_['main_url'].str.split('//').str[1] == df_['emails'].str.split('@').str[1])
df_ = df_[m].groupby('main_url').agg({'emails':lambda x: list(x)})
print(df_)
emails
main_url
http://keilstruplund.dk [[email protected]]
http://kirsebaergaarden.com [[email protected]]
http://koglernes.dk [[email protected]]
At last, map the result to original dataframe
URL_WITH_EMAILS_DF['matched'] = URL_WITH_EMAILS_DF['main_url'].map(df_['emails']).fillna("").apply(list)
print(URL_WITH_EMAILS_DF[['main_url', 'matched']])
main_url matched
0 http://keilstruplund.dk [[email protected]]
1 http://kirsebaergaarden.com [[email protected]]
2 http://koglernes.dk [[email protected]]
3 http://kongehojensbornehave.dk []