I have website visitor data that resembles the example below:
id | pages |
---|---|
001 | /ice-cream, /bagels, /bagels/flavors |
002 | /pizza, /pizza/flavors, /pizza/recipe |
I would like to transform to below, where I can count the amount of times they have visited a part of my site that deals with specific content. A general count of all pageviews, delimited by comma, would be great as well.
id | bagel_count |
---|---|
001 | 2 |
002 | 0 |
id | pizza_count |
---|---|
001 | 0 |
002 | 3 |
id | total_pages_count |
---|---|
001 | 3 |
002 | 3 |
I have the option to perform in SQL or Python but I am not sure what is easier, hence why I am asking the question.
I have referenced following questions but they are not serving my purpose:
Count the number of occurrences of a character in a string (this was close but I am not sure how to apply to a dataframe)
CodePudding user response:
We can do split
then explode
and get your result with crosstab
df['pages'] = df.pages.str.split(r'[/, ]')
s = df.explode('pages')
out = pd.crosstab(s['id'], s['pages']).drop('', axis=1)
out
Out[427]:
pages bagels flavors ice-cream pizza recipe
id
1 2 1 1 0 0
2 0 1 0 3 1
CodePudding user response:
I would go this route if you prefer SQL. I typically leave pivoting to reporting applications, but if you really insist, Snowflake has good documentation on it for you to take it from here
with cte (id, pages) as
(select '001', '/ice-cream, /bagels, /bagels/flavors' union all
select '002', '/pizza, /pizza/flavors, /pizza/recipe')
select id,
t2.value,
count(*) as word_count,
length(pages)-length(replace(pages,',','')) 1 as user_page_count
from cte, lateral split_to_table(translate(cte.pages, '- ,','/'),'/') as t2--normalize word delimiters using translate(similar to replace)
where t2.value in ('bagels','pizza') --your list goes here
group by id, pages, t2.value;
CodePudding user response:
I personally like to use regular expressions with groups, then explode into a df which I merge back into the main. This has several advantages over the split
method, mainly, conserving excessive memory usage which results in dramatic performance improvement.
import re
from typing import List, Dict
import pandas as pd
my_words = [
'bagels',
'flavors',
'ice-cream',
'pizza',
'recipe'
]
def count_words(string:str, words:List[str]=my_words) -> Dict[str, int]:
"""
Returns a dictionary of summated values
for selected words contained in string
"""
# Create a dictionary to return values
match_dict = {x:0 for x in words}
# Numbered capture groups with word boundaries
# Note this will not allow pluralities, unless specified
# Also: cache (or frontload) this value to improve performance
my_regex_string = '|'.join((fr'\b({x})\b' for x in words))
my_pattern = re.compile(my_regex_string)
for match in my_pattern.finditer(string):
value = match.group()
match_dict[value] =1
return match_dict
# Create a new df with values from function
new_df = df['pages'].apply(match_words).apply(pd.Series)
bagels flavors ice-cream pizza recipe
0 2 1 1 0 0
1 0 1 0 3 1
# Merge back to the main df
df[['id']].merge(new_df, left_index=True, right_index=True)
id bagels flavors ice-cream pizza recipe
0 1 2 1 1 0 0
1 2 0 1 0 3 1
CodePudding user response:
Going to mark @BENY's answer correct because of its elegance but I found a way to do this in python, focusing on a specific keyword - Assume df
looks like my original table
df['bagel_count'] = df["pages"].str.count('bagel')