Home > front end >  Find number of unique elements in a list column in DataFrame
Find number of unique elements in a list column in DataFrame

Time:03-01

I have a dataframe in which for column 'pages' I need to count number of unique elements until there's an appearance of an element that contains the sub-string 'log in'. In case there's more than one element like this in the same list - I need to count until the first one.

input example:

site pages
zoom.us ['zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344']
zoom.us ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye']

output example:

site pages unique_pages_before_log_in
zoom.us ['zoom.us/register', 'zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] 1
zoom.us ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] 3

I thought about using set to count unique values, but I don't know how to count only until the first 'log in' sub-string appears. something like this:

df['unique_pages_before_login'] = df['pages'].apply(lambda l: len(set(l[:l.index('zoom.us/log_in')])))

I will appreciate any help :)

CodePudding user response:

Looks like you have to use .apply() here. One approach is to add each element you find to a set until you find one that contains your search string. When you do find this, return the size of the set you've created.

def count_unique_before_login(pages):
    c = set()
    for item in pages:
        if "log_in" in item: return len(c)
        c.add(item)
    return None # No log_in found


df = {'site': {0: 'zoom.us', 1: 'zoom.us'},
 'pages': {0: ['zoom.us/register',
   'zoom.us/log_in/=?sdsd',
   'zoom.us/log_in/=a3344'],
  1: ['zoom.us/about_us',
   'zoom.us/error',
   'zoom.us/help',
   'zoom.us/log_in/jjjsl',
   'zoom.us/log_in/llaye']}}

df["unique_pages_before_log_in"] = df["pages"].apply(count_unique_before_login)

Which gives:

      site  ... unique_pages_before_log_in
0  zoom.us  ...                          1
1  zoom.us  ...                          3

CodePudding user response:

First, let's apply a function to find the first log_in considering your needs. This function, should count the unique pages (preserving order) until we find a log in instance.

def find_log_in(pages):
    # Duplicate removal while preserving order original idea from: https://stackoverflow.com/a/17016257/3281097
    # Python 3.7  only
    for i, page in enumerate(dict.fromkeys(pages)):
        if page.startswith("zoom.us/log_in/"):
            return i
    return None  # -1 or any value that you prefer

Now, you just need to apply this function to your column:

df["unique_pages_before_log_in"] = df["pages"].apply(find_log_in)

CodePudding user response:

Try:

df["unique_pages_before_log_in"] = df["pages"].apply(lambda x: len(x[:min(i for i, s in enumerate(x) if "log_in" in s)]))

>>> df
      site  ... unique_pages_before_log_in
0  zoom.us  ...                          1
1  zoom.us  ...                          3

[2 rows x 3 columns]

CodePudding user response:

You can try to use re.findall and for loop to get what you want.

import re

def find_unique_elements(list_, matchword):
    unique_no = []
    for row in list_:
        for i in range(len(row)):
            if matchword in re.findall(matchword,str(row[i])):
                unique_no.append(i)
                break

    return unique_no

matchword = "log_in"
list_ = df["pages"]

ddf = find_unique_elements(list_,matchword)
df["unique_pages_before_log_in"] = ddf
  • Related