Home > database >  Unable to process list of sets having multiple element in list when list of set can be empty
Unable to process list of sets having multiple element in list when list of set can be empty

Time:04-15

When I scrap websites for all the emails on each website and try to output it, I can get a given data frame which is a list of sets of multiple elements for each website :

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': [{'[email protected]', '[email protected]'}, set(),{'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}]}, 
                                    {'main_url': 'http://kirsebaergaarden.com', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
                                     {'main_url': 'http://koglernes.dk', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'},set(), set(), {'[email protected]'}, {'[email protected]'}]},
                                      {'main_url': 'http://kongehojensbornehave.dk', 'emails': [set()]}
                                   ])

enter image description here

However, I want to process the data frame to look like the following:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]',  '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},                                        
                                     {'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
                                     {'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
                                      {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                                   ])

enter image description here

How can it be achieve?

I have tried the following code but it only manage to return first element of first set only while running to error when there is no element in the email list for a given website :

URL_WITH_EMAILS_DF['emails'] = [', '.join(x.pop()) if not None else "" for x in URL_WITH_EMAILS_DF['emails'].values]

P.S:

  1. As per the first data frame, I needed to get a set of multiple emails to be inserted because there can be multiple webpages for a single website and I do not want to take duplicate emails from each web page
  2. If a list has [set(), set()] or [], it should be consider as empty. Also if set() is available as a value on 'emails' it just throw "TypeError: 'NoneType' object is not iterable".
  3. Thanks to Chris, he provided a solution here. However, it shows error mentioned in point#2. The solution is as follows:
from itertools import chain
        URL_WITH_EMAILS_DF['emails'] = URL_WITH_EMAILS_DF.emails.apply(lambda x: list(set(chain.from_iterable(x))))

Note: If anyone of you are benefited from this question and answers, please do upvote my question. Thanks in advance.

CodePudding user response:

Use a simple list comprehension with a set union, this should be the fastest:

URL_WITH_EMAILS_DF['emails'] = [list(set.union(*s))
                                for s in URL_WITH_EMAILS_DF['emails']]

output:

                         main_url                                                                                                                                                                                                                                                                                                                                                   emails
0         http://keilstruplund.dk  [[email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]]
1     http://kirsebaergaarden.com                                                                                                                                                                                                                                                                                                          [[email protected], [email protected]]
2             http://koglernes.dk                                                                                                                                                                                                                                                                                                                                      [[email protected]]
3  http://kongehojensbornehave.dk                                                                                                                                                                                                                                                                                                                                                       []

From a string that represents a list of sets:

from ast import literal_eval

URL_WITH_EMAILS_DF['emails'] = [list(set.union(*literal_eval(s)))
                                for s in URL_WITH_EMAILS_DF['emails']]

or, if you have improperly formed strings, and assuming you don't have single quotes in your email addresses, you can use a regex:

import re
URL_WITH_EMAILS_DF['emails'] = [list(set(re.findall("'([^'] @[^'] )'", s)))
                                for s in URL_WITH_EMAILS_DF['emails']]

CodePudding user response:

You can write a function to combine the sets through unions and then perform your filtering when casting it into a list:

def combine_sets(list_of_sets):
    output = set()    
    for one_set in list_of_sets:
        output = output.union(one_set)

    # Do whatever filtering you need here
    return list(email if email is not None else '' for email in output)

Then your code will just be:

URL_WITH_EMAILS_DF['emails'] = URL_WITH_EMAILS_DF['emails'].map(combine_sets)

Also, just a side note regarding your:

URL_WITH_EMAILS_DF['emails'] = [', '.join(x.pop()) if not None else "" for x in URL_WITH_EMAILS_DF['emails'].values]

you are checking if not None which will always evaluate to True, you should be typing if x is not None instead.

  • Related