How to use multiple string conditions and numerical calculations on multiple columns to create multi-CodePudding

Input:

(Having error in uploading image, otherwise I always do.)

import pandas as pd

df = pd.DataFrame(
    {
     'keyword': ['app store', 'app marketing', 'app store', 'app marketing'],
     'rank': [10, 12, 10, 12],
     'impression': [100, 200, 100, 200],
     'landing page': ['ngl.com/123', 'ngl.com/123', 'oats.com/123', 'oats.com/123']
    }
)

df

Output:

import pandas as pd

df = pd.DataFrame(
    {
     'keyword': ['app', 'store', 'marketing', 'app', 'store', 'marketing'],
     'mean_rank': [11, 10, 12, 11, 10, 12],
     'impression': [300, 100, 200, 300, 100, 200],
     'landing page': ['ngl.com/123', 'ngl.com/123', 'ngl.com/123', 'oats.com/123', 'oats.com/123', 'oats.com/123'],
     'keyword_length':[3, 5, 9, 3, 5, 9],
     'impression_per_char':[75, 16.67, 20, 75, 16.67, 20]
    }
)

df

Maybe this could be used to find words in keyword:

words = 'app store'
re.findall('\w ', words.casefold())

mean_rank = Mean rank of the word in keyword.

keyword_length = length of the word in keyword.

impression_per_char = Impression/(keyword_length 1)

Actual dataset has 10,000 rows. This one is made by me, please tell if something is wrong with it. I'll be parallelly working on this for the next few hours.

Also, for 'mean_rank' column, you can take weighted mean or some made up equation that (maybe also) uses 'impression', 'keyword_length' and/or 'impression_per_char' to find a sensible rank. If you do so, then I'll select that as final answer instead.

CodePudding user response：

Use Series.str.casefold with Series.str.split and DataFrame.explode for separate words, get legths of words by Series.str.len, then aggregate sum and mean and last create impression_per_char column:

df = df.assign(keyword = df['keyword'].str.casefold().str.split()).explode('keyword')
df['keyword_length'] = df['keyword'].str.len()
    
df = (df.groupby(['keyword','landing page', 'keyword_length' ], as_index=False, sort=False)
        .agg(mean_rank=('rank','mean'), impression=('impression', 'sum')))

df['impression_per_char'] = df['impression'].div(df['keyword_length'].add(1))
print (df)
     keyword  landing page  keyword_length  mean_rank  impression  \
0        app   ngl.com/123               3         11         300   
1      store   ngl.com/123               5         10         100   
2  marketing   ngl.com/123               9         12         200   
3        app  oats.com/123               3         11         300   
4      store  oats.com/123               5         10         100   
5  marketing  oats.com/123               9         12         200   

   impression_per_char  
0            75.000000  
1            16.666667  
2            20.000000  
3            75.000000  
4            16.666667  
5            20.000000