I have a column in dataframe consisting of lists of URLs.
index url_all
0 ['https://google.com/7TU4za', 'http://twitter.com/d']
1 ['https://google.com/7TU4bb', 'facebook.com']
2 ['https://google.com/7TU4bc', 'https://twitter.com/a']
3 ['http://google.com/7TU4ad', 'https://twitter.com/b']
4 ['https://google.com/7TU4ze', 'twitter.com/c']
I want to remove elements in the list if it starts with 'http' or 'https'. The desired output is here.
index url_all
0 []
1 ['facebook.com']
2 []
3 []
4 ['twitter.com/c']
So far I have tried the following, but it did not work.
df['url_all'] = df['url_all'].apply(lambda lst: [x for x in lst if not x.startswith("'http|'https")])
It gives this output as below (for brevity, only the first few rows of the output are shown):
url_all
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
How can I do that please?
CodePudding user response:
You can use .apply()
with ast.literal_eval()
(note that anything that starts with "https"
will also start with "http"
, per a suggestion from Nick):
import ast
df['url_all'] = (df['url_all']
.apply(ast.literal_eval)
.apply(lambda lst: [x for x in lst if not x.startswith("http")]))
This outputs:
index url_all
0 0 []
1 1 [facebook.com]
2 2 []
3 3 []
4 4 [twitter.com/c]
CodePudding user response:
I think this fixes your problem.
First, I made the same DataFrame you have:
import pandas as pd
data = [['https://google.com/7TU4za', 'http://twitter.com/d'], ['https://google.com/7TU4bb', 'facebook.com'], ['https://google.com/7TU4bc', 'https://twitter.com/a'], ['http://google.com/7TU4ad', 'https://twitter.com/b'], ['https://google.com/7TU4ze', 'twitter.com/c']]
dictionary = {'url_all':data}
df = pd.DataFrame(dictionary)
This gave the following, like you have:
url_all
0 ['https://google.com/7TU4za', 'http://twitter.com/d']
1 ['https://google.com/7TU4bb', 'facebook.com']
2 ['https://google.com/7TU4bc', 'https://twitter.com/a']
3 ['http://google.com/7TU4ad', 'https://twitter.com/b']
4 ['https://google.com/7TU4ze', 'twitter.com/c']
Then, the following line gave me the result you wanted. Note that anything that starts with https
also starts with http
, so we don't have to check both conditions:
df['url_all'] = df['url_all'].apply(lambda lst: [x for x in lst if not x.startswith('http')])
Now, the DataFrame looks like this, as desired:
url_all
0 []
1 ['facebook.com']
2 []
3 []
4 ['twitter.com/c']
CodePudding user response:
As you have strings, this can be solved efficiently with a regex:
df['url_all'] = df['url_all'].str.replace(r"'http[^']*'(?:,\s*)?", "", regex=True)
Variant to handle the dangling commas:
regex = r"(?:,\s*)?'http[^']*'(?:,\s*)?(?=\]$)|'http[^']*'(?:,\s*)?"
df['url_all'] = df['url_all'].str.replace(regex, "", regex=True)
Output:
url_all
0 []
1 ['facebook.com']
2 []
3 []
4 ['twitter.com/c']
CodePudding user response:
You can try this:
df['url_all'].apply(lambda lst: list(filter(lambda url: not url.startswith('http'), lst)))
output:
>>> df['url_all'].apply(lambda lst: list(filter(lambda url: not url.startswith('http'), lst)))
0 [twitter.com/c]
1 [facebook.com]
2 [twitter.com/c]
3 []
Name: url_all, dtype: object
CodePudding user response:
U can try using this code,,
class CleanLink:
"""class that clean stuff"""
def __init__(self,list):
#creat a list attribute
self.list = list
def clean(self):
clean_links = [] #empty clean link list
for x in self.list: #looping through list attribute
if 'https://' in x: #looking for https
without_https = x.replace('https://','') #using replace method
clean_links.append(without_https) #appending the clean link
elif 'http://' in x: #looking for http
without_http = x.replace('http://','') #using replace method
clean_links.append(without_http) #appending the clean link
else:
clean_links.append(x)#appending the clean link
return clean_links #returing the clean link
CodePudding user response:
Or this one
class CleanLink:
"""class that clean stuff"""
def __init__(self,list):
#creat a list attribute
self.list = list
def clean(self):
clean_links = [] #empty clean link list
for x in self.list: #looping through list attribute
if 'https://' in x: #looking for https
continue
elif 'http://' in x: #looking for http
continue
else:
clean_links.append(x)#appending the clean link
return clean_links #returing the clean link
CodePudding user response:
Another way to filter out your column and convert values to real list:
df['url_all'] = (df['url_all'].str[1:-1].str.split(', ').explode().str.strip("'")
.loc[lambda x: ~x.str.startswith('http')]
.groupby(level=0).agg(list)
.reindex(df.index, fill_value=[]))
print(df)
# Output
url_all
0 []
1 [facebook.com]
2 []
3 []
4 [twitter.com/c]
- Remove brackets and split on comma
.str[1:-1].str.split(', ')
- Explode your lists and strip quote
.explode().str.strip("'")
- Keep expected values
.loc[lambda x: ...]
- Reshape your dataframe and create list:
.groupby(level=0).agg(list)
- Reindex the output and fill nan values by empty list
.reindex(...)