I have a dataframe in which for column 'pages' I need to count number of unique elements until there's an appearance of an element that contains the sub-string 'log in'. In case there's more than one element like this in the same list - I need to count until the first one.
input example:
site | pages |
---|---|
zoom.us | ['zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] |
zoom.us | ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] |
output example:
site | pages | unique_pages_before_log_in |
---|---|---|
zoom.us | ['zoom.us/register', 'zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] | 1 |
zoom.us | ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] | 3 |
I thought about using set to count unique values, but I don't know how to count only until the first 'log in' sub-string appears. something like this:
df['unique_pages_before_login'] = df['pages'].apply(lambda l: len(set(l[:l.index('zoom.us/log_in')])))
I will appreciate any help :)
CodePudding user response:
Looks like you have to use .apply()
here. One approach is to add each element you find to a set until you find one that contains your search string. When you do find this, return the size of the set you've created.
def count_unique_before_login(pages):
c = set()
for item in pages:
if "log_in" in item: return len(c)
c.add(item)
return None # No log_in found
df = {'site': {0: 'zoom.us', 1: 'zoom.us'},
'pages': {0: ['zoom.us/register',
'zoom.us/log_in/=?sdsd',
'zoom.us/log_in/=a3344'],
1: ['zoom.us/about_us',
'zoom.us/error',
'zoom.us/help',
'zoom.us/log_in/jjjsl',
'zoom.us/log_in/llaye']}}
df["unique_pages_before_log_in"] = df["pages"].apply(count_unique_before_login)
Which gives:
site ... unique_pages_before_log_in
0 zoom.us ... 1
1 zoom.us ... 3
CodePudding user response:
First, let's apply a function to find the first log_in considering your needs. This function, should count the unique pages (preserving order) until we find a log in instance.
def find_log_in(pages):
# Duplicate removal while preserving order original idea from: https://stackoverflow.com/a/17016257/3281097
# Python 3.7 only
for i, page in enumerate(dict.fromkeys(pages)):
if page.startswith("zoom.us/log_in/"):
return i
return None # -1 or any value that you prefer
Now, you just need to apply this function to your column:
df["unique_pages_before_log_in"] = df["pages"].apply(find_log_in)
CodePudding user response:
Try:
df["unique_pages_before_log_in"] = df["pages"].apply(lambda x: len(x[:min(i for i, s in enumerate(x) if "log_in" in s)]))
>>> df
site ... unique_pages_before_log_in
0 zoom.us ... 1
1 zoom.us ... 3
[2 rows x 3 columns]
CodePudding user response:
You can try to use re.findall
and for loop
to get what you want.
import re
def find_unique_elements(list_, matchword):
unique_no = []
for row in list_:
for i in range(len(row)):
if matchword in re.findall(matchword,str(row[i])):
unique_no.append(i)
break
return unique_no
matchword = "log_in"
list_ = df["pages"]
ddf = find_unique_elements(list_,matchword)
df["unique_pages_before_log_in"] = ddf