Hi everyone? i white a function to replace NaN values in DataFrame (1750 000 lines):
def Android_iOs_device_os_cange(df):
def find_Android_brand(df):
list_for_android = list(df[df['device_os'] == 'Android'].device_brand.unique())
list_for_android.remove('(not set)')
return list_for_android
def find_iOS_brand(df):
list_for_iOS = list(df[df['device_os'] == 'iOS'].device_brand.unique())
list_for_iOS.remove('(not set)')
return list_for_iOS
for i in list(df[df.device_os.isnull() & df.device_brand.notnull()].index):
if df.device_brand[i] in find_Android_brand(df) and pd.isnull(df.loc[i, 'device_os']) == True:
df['device_os'][i] = df.loc[i, 'device_os'] = 'Android'
elif df.device_brand[i] in find_iOS_brand(df) and pd.isnull(df.loc[i, 'device_os']) == True:
df['device_os'][i] = df.loc[i, 'device_os'] = 'iOS'
else:
df['device_os'][i] = df.loc[i, 'device_os'] = '(not set)'
return df
It fulfills its purpose, but but he replaced only 20,000 lines in 3.5 hours. I understand that the catch here is the for loop, but I don't understand how to make the function better. Who can advise anything?
I try to make it with function loc, but for my it always ended with
'Series' object has no attribute 'device_os'
CodePudding user response:
Try this:
import numpy as np
def Android_iOs_device_os_cange(df):
list_for_android = list(df[df['device_os'] == 'Android'].device_brand.unique())
list_for_android.remove('(not set)')
list_for_iOS = list(df[df['device_os'] == 'iOS'].device_brand.unique())
list_for_iOS.remove('(not set)')
df['device_os'] = np.where((df['device_brand'].isin(list_for_iOS)) & (df['device_os'].isnull()), 'iOs', df['device_os'])
df['device_os'] = np.where((df['device_brand'].isin(list_for_android)) & (df['device_os'].isnull()), 'Android', df['device_os'])
return df
The changes I made:
1) for the lists you do not need a function - this is time consuming running the function all over again
2) Used np.where
. Loops is last resort in Pandas