How do you iterate over a pandas table and replace nulls with a unique or incremental number?-CodePudding

I have a pandas dataframe that has most of column Job_Code as unique text, however there are missing records recorded as nulls (ie "NaN"). To push this into a Database with Job_Code as a Primary Key I want to replace these missing records with new unique values. My current df is 265 rows × 41 columns. For simplicity:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})

My desired outcome is:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', 'tempkey1', 'J003', 'tempkey2']})

My initial code to solve executed, but didn't work:

c=1
for e in enumerate(df.Job_Code):
     if pd.isnull(e):
        e == "tempkey"   str(c)
         c 1

I was looking and couldn't find a solution to my problem, though more than once there was a case of "Don't use iterrows!".

Feel free to comment on any methods I've used.

CodePudding user response：

try this:

mask = df['Job_Code'].isna()
tempkey = (df['Job_Code'].groupby(mask).cumcount()   1).map('tempkey{}'.format)
df['Job_Code'] = df['Job_Code'].fillna(tempkey)
print(df)
>>>

    Client          Job_Code
0   Microsoft       J001
1   Google          J002
2   Apple           tempkey1
3   StackOverflow   J003
4   PostGres        tempkey2

CodePudding user response：

You can create mask by test missing values by Series.isna and then in DataFrame.loc add range by RangeIndex with count number of Trues by sum:

m = df['Job_Code'].isna()
df.loc[m, 'Job_Code'] = 'tempkey'   pd.RangeIndex(1,m.sum() 1).astype(str)
print (df)
          Client  Job_Code
0      Microsoft      J001
1         Google      J002
2          Apple  tempkey1
3  StackOverflow      J003
4       PostGres  tempkey2

Performance:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})

#50000 rows
df = pd.concat([df] * 10000, ignore_index=True)
print (df)


In [23]: %%timeit
    ...: c=1
    ...: for i, row in df.iterrows():
    ...:     if pd.isnull(row['Job_Code']):
    ...:         df.at[i,'Job_Code'] = "tempkey"   str(c)
    ...:         c= c 1
    ...:         
4.35 s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: %%timeit
    ...: mask = df['Job_Code'].isna()
    ...: tempkey = (df['Job_Code'].groupby(mask).cumcount()   1).map('tempkey{}'.format)
    ...: df['Job_Code'] = df['Job_Code'].fillna(tempkey)
    ...: 
28.1 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [25]: %%timeit
    ...: m = df['Job_Code'].isna()
    ...: df.loc[m, 'Job_Code'] = 'tempkey'   pd.RangeIndex(1,m.sum() 1).astype(str)
    ...: 
2.86 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

In the end I used this and it worked, but I'd accept a better answer, one that isn't iterrows reliant.

c=1
for i, row in df.iterrows():
    if pd.isnull(row['Job_Code']):
        df.at[i,'Job_Code'] = "tempkey"   str(c)
        c= c 1