I wrote functions to manipulate url string in my dataframe and create new columns based on the functions outputs.
I define my class as:
class URL(object):
def __init__(self, url):
self.url = url
self.domain = url.split('//')[-1].split('/')[0]
self.response = get(self.url)
self.pq = PyQuery(self.response.text)
def entropy(self):
string = self.url.strip()
prob = [float(string.count(c)) / len(string) for c in dict.fromkeys(list(string))]
entropy = sum([(p * math.log(p) / math.log(2.0)) for p in prob])
return entropy
def bodyLength(self):
if self.pq is not None:
return len(self.pq('html').text())
else:
return 0
def run(self,df):
df['entropy'] = np.vectorize(self.entropy)(df['url_without_parameters'])
return df
But my brain has stopped and I couldnt figure out how to call my class and create new columns.
CodePudding user response:
If I understood correctly: first create a column of URL
instances from the 'url_without_parameters'
column, then create a second column by calling the entropy
method for each instance. Both actions can be done with the apply
method:
urls = df['url_without_parameters'].apply(URL)
df['entropy'] = urls.apply(lambda url: url.entropy())
Or in a single line:
df['entropy'] = df['url_without_parameters'].apply(lambda url_string: URL(url_string).entropy())