When I scrap websites for all the emails on each website and try to output it, I can get a given data frame which is a list of sets of multiple elements for each website :
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': [{'[email protected]', '[email protected]'}, set(),{'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}]},
{'main_url': 'http://kirsebaergaarden.com', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://koglernes.dk', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'},set(), set(), {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
])
However, I want to process the data frame to look like the following:
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},
{'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
{'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
])
How can it be achieve?
I have tried the following code but it only manage to return first element of first set only while running to error when there is no element in the email list for a given website :
URL_WITH_EMAILS_DF['emails'] = [', '.join(x.pop()) if not None else "" for x in URL_WITH_EMAILS_DF['emails'].values]
Please help. Thanks
P.S: As per first dataframe, I needed to get a set of multiple emails to be inserted because there can be multiple webpage for a single website and I do not want to take duplicate email from each web page
CodePudding user response:
chain.from_iterable
can solve this problem.
from itertools import chain
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': [{'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}]},
{'main_url': 'http://kirsebaergaarden.com', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://koglernes.dk', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
])
URL_WITH_EMAILS_DF['emails'] = URL_WITH_EMAILS_DF.emails.apply(lambda x: list(set(chain.from_iterable(x))))