pd.fillna or pd.DataFrame.loc for NaN values by function with condition-CodePudding

Hi everyone? i white a function to replace NaN values in DataFrame (1750 000 lines):

def Android_iOs_device_os_cange(df):
    def find_Android_brand(df):
        list_for_android = list(df[df['device_os'] == 'Android'].device_brand.unique())
        list_for_android.remove('(not set)')
        return list_for_android
    def find_iOS_brand(df):
        list_for_iOS = list(df[df['device_os'] == 'iOS'].device_brand.unique())
        list_for_iOS.remove('(not set)')
        return list_for_iOS
    for i in list(df[df.device_os.isnull() & df.device_brand.notnull()].index):
        if df.device_brand[i] in find_Android_brand(df) and pd.isnull(df.loc[i, 'device_os']) == True:
            df['device_os'][i] = df.loc[i, 'device_os'] = 'Android'
        elif df.device_brand[i] in find_iOS_brand(df) and pd.isnull(df.loc[i, 'device_os']) == True:
            df['device_os'][i] = df.loc[i, 'device_os'] = 'iOS'
        else:
            df['device_os'][i] = df.loc[i, 'device_os'] = '(not set)'
    return df

It fulfills its purpose, but but he replaced only 20,000 lines in 3.5 hours. I understand that the catch here is the for loop, but I don't understand how to make the function better. Who can advise anything?

I try to make it with function loc, but for my it always ended with

'Series' object has no attribute 'device_os'

CodePudding user response：

Try this:

import numpy as np
def Android_iOs_device_os_cange(df):
    list_for_android = list(df[df['device_os'] == 'Android'].device_brand.unique())
    list_for_android.remove('(not set)')

    list_for_iOS = list(df[df['device_os'] == 'iOS'].device_brand.unique())
    list_for_iOS.remove('(not set)')

    df['device_os'] = np.where((df['device_brand'].isin(list_for_iOS)) & (df['device_os'].isnull()), 'iOs', df['device_os'])
    df['device_os'] = np.where((df['device_brand'].isin(list_for_android)) & (df['device_os'].isnull()), 'Android', df['device_os'])
    return df

The changes I made: 1) for the lists you do not need a function - this is time consuming running the function all over again 2) Used np.where. Loops is last resort in Pandas