Home > OS >  Trying to add prefixes to url if not present in pandas df column
Trying to add prefixes to url if not present in pandas df column

Time:02-19

I am trying to add prefixes to urls in my 'Websites' Column. I can't figure out how to keep each new iteration of the helper column from overwriting everything from the previous column.

for example say I have the following urls in my column:

http://www.bakkersfinedrycleaning.com/
www.cbgi.org
barstoolsand.com

This would be the desired end state:

http://www.bakkersfinedrycleaning.com/
http://www.cbgi.org
http://www.barstoolsand.com

this is as close as I have been able to get:

def nan_to_zeros(df, col):
    new_col = f"nanreplace{col}"
    df[new_col] = df[col].fillna('~')
    return df

df1 = nan_to_zeros(df1, 'Website')
df1['url_helper'] = df1.loc[~df1['nanreplaceWebsite'].str.startswith('http')| ~df1['nanreplaceWebsite'].str.startswith('www'), 'url_helper'] = 'https://www.' 
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('http'), 'url_helper'] = ""
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('www'),'url_helper'] = 'www'


print(df1[['nanreplaceWebsite',"url_helper"]])

which just gives me a helper column of all www because the last iteration overwrites all fields. Any direction appreciated.

Data:

{'Website': ['http://www.bakkersfinedrycleaning.com/', 
             'www.cbgi.org', 'barstoolsand.com']}

CodePudding user response:

IIUC, there are 3 things to fix here:

  1. df1['url_helper'] = shouldn't be there

  2. | should be & in the first condition because 'https://www.' should be added to URLs that start with neither of the strings in the condition. The error will become apparent if we check the first condition after the other two conditions.

  3. The last condition should add "http://" instead of "www".

Alternatively, your problem could be solved using np.select. Pass in the multiple conditions in the conditions list and their corresponding choice list and assign values accordingly:

import numpy as np
s = df1['Website'].fillna('~')
df1['fixed Website'] = np.select([~(s.str.startswith('http') | ~s.str.contains('www')), 
                                  ~(s.str.startswith('http') | s.str.contains('www')) 
                                 ], 
                                 ['http://'   s, 'http://www.'   s], s)    

Output:

                                  Website                            fixed Website
0  http://www.bakkersfinedrycleaning.com/   http://www.bakkersfinedrycleaning.com/
1                            www.cbgi.org                      http://www.cbgi.org
2                        barstoolsand.com              http://www.barstoolsand.com
  • Related