Drop element from list in a column if it startswith certain character-CodePudding

I have a column in dataframe consisting of lists of URLs.

    index        url_all
    0       ['https://google.com/7TU4za', 'http://twitter.com/d']
    1       ['https://google.com/7TU4bb', 'facebook.com']
    2       ['https://google.com/7TU4bc', 'https://twitter.com/a']
    3       ['http://google.com/7TU4ad', 'https://twitter.com/b']
    4       ['https://google.com/7TU4ze', 'twitter.com/c']

I want to remove elements in the list if it starts with 'http' or 'https'. The desired output is here.

    index        url_all
    0       []
    1       ['facebook.com']
    2       []
    3       []
    4       ['twitter.com/c']

So far I have tried the following, but it did not work.

df['url_all'] = df['url_all'].apply(lambda lst: [x for x in lst if not x.startswith("'http|'https")])

It gives this output as below (for brevity, only the first few rows of the output are shown):

              url_all
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]

How can I do that please?

CodePudding user response：

You can use .apply() with ast.literal_eval() (note that anything that starts with "https" will also start with "http", per a suggestion from Nick):

import ast

df['url_all'] = (df['url_all']
    .apply(ast.literal_eval)
    .apply(lambda lst: [x for x in lst if not x.startswith("http")]))

This outputs:

   index          url_all
0      0               []
1      1   [facebook.com]
2      2               []
3      3               []
4      4  [twitter.com/c]

CodePudding user response：

I think this fixes your problem.

First, I made the same DataFrame you have:

import pandas as pd

data = [['https://google.com/7TU4za', 'http://twitter.com/d'], ['https://google.com/7TU4bb', 'facebook.com'], ['https://google.com/7TU4bc', 'https://twitter.com/a'], ['http://google.com/7TU4ad', 'https://twitter.com/b'], ['https://google.com/7TU4ze', 'twitter.com/c']]
dictionary = {'url_all':data}
df = pd.DataFrame(dictionary)

This gave the following, like you have:

            url_all
    0       ['https://google.com/7TU4za', 'http://twitter.com/d']
    1       ['https://google.com/7TU4bb', 'facebook.com']
    2       ['https://google.com/7TU4bc', 'https://twitter.com/a']
    3       ['http://google.com/7TU4ad', 'https://twitter.com/b']
    4       ['https://google.com/7TU4ze', 'twitter.com/c']

Then, the following line gave me the result you wanted. Note that anything that starts with https also starts with http, so we don't have to check both conditions:

df['url_all'] = df['url_all'].apply(lambda lst: [x for x in lst if not x.startswith('http')])

Now, the DataFrame looks like this, as desired:

            url_all
    0       []
    1       ['facebook.com']
    2       []
    3       []
    4       ['twitter.com/c']

CodePudding user response：

As you have strings, this can be solved efficiently with a regex:

df['url_all'] = df['url_all'].str.replace(r"'http[^']*'(?:,\s*)?", "", regex=True)

Variant to handle the dangling commas:

regex = r"(?:,\s*)?'http[^']*'(?:,\s*)?(?=\]$)|'http[^']*'(?:,\s*)?"

df['url_all'] = df['url_all'].str.replace(regex, "", regex=True)

Output:

             url_all
0                 []
1   ['facebook.com']
2                 []
3                 []
4  ['twitter.com/c']

CodePudding user response：

You can try this:

df['url_all'].apply(lambda lst: list(filter(lambda url: not url.startswith('http'), lst)))

output:

>>> df['url_all'].apply(lambda lst: list(filter(lambda url: not url.startswith('http'), lst)))
0    [twitter.com/c]
1     [facebook.com]
2    [twitter.com/c]
3                 []
Name: url_all, dtype: object

CodePudding user response：

U can try using this code,,

class CleanLink:
"""class that clean stuff"""
def __init__(self,list):
    #creat a list attribute
    self.list = list
    
def clean(self):
    clean_links = [] #empty clean link list
    
    for x in self.list: #looping through list attribute
    
        if 'https://' in x: #looking for https
            without_https = x.replace('https://','') #using replace method
            clean_links.append(without_https) #appending the clean link
        
        elif 'http://' in x: #looking for http
        
            without_http = x.replace('http://','') #using replace method
            clean_links.append(without_http) #appending the clean link
            
        else:
            clean_links.append(x)#appending the clean link
    return clean_links #returing the clean link

CodePudding user response：

Or this one

class CleanLink:
"""class that clean stuff"""
def __init__(self,list):
    #creat a list attribute
    self.list = list
    
def clean(self):
    clean_links = [] #empty clean link list
    
    for x in self.list: #looping through list attribute
    
        if 'https://' in x: #looking for https
            continue
        elif 'http://' in x: #looking for http
            continue        
        else:
            clean_links.append(x)#appending the clean link
    return clean_links #returing the clean link

CodePudding user response：

Another way to filter out your column and convert values to real list:

df['url_all'] = (df['url_all'].str[1:-1].str.split(', ').explode().str.strip("'")
                              .loc[lambda x: ~x.str.startswith('http')]
                              .groupby(level=0).agg(list)
                              .reindex(df.index, fill_value=[]))
print(df)

# Output
           url_all
0               []
1   [facebook.com]
2               []
3               []
4  [twitter.com/c]

Remove brackets and split on comma .str[1:-1].str.split(', ')
Explode your lists and strip quote .explode().str.strip("'")
Keep expected values .loc[lambda x: ...]
Reshape your dataframe and create list: .groupby(level=0).agg(list)
Reindex the output and fill nan values by empty list .reindex(...)