I have a pandas dataframe that has most of column Job_Code as unique text, however there are missing records recorded as nulls (ie "NaN"). To push this into a Database with Job_Code as a Primary Key I want to replace these missing records with new unique values. My current df is 265 rows × 41 columns. For simplicity:
df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})
My desired outcome is:
df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
'Job_Code': ['J001', 'J002', 'tempkey1', 'J003', 'tempkey2']})
My initial code to solve executed, but didn't work:
c=1
for e in enumerate(df.Job_Code):
if pd.isnull(e):
e == "tempkey" str(c)
c 1
I was looking and couldn't find a solution to my problem, though more than once there was a case of "Don't use iterrows!".
Feel free to comment on any methods I've used.
CodePudding user response:
try this:
mask = df['Job_Code'].isna()
tempkey = (df['Job_Code'].groupby(mask).cumcount() 1).map('tempkey{}'.format)
df['Job_Code'] = df['Job_Code'].fillna(tempkey)
print(df)
>>>
Client Job_Code
0 Microsoft J001
1 Google J002
2 Apple tempkey1
3 StackOverflow J003
4 PostGres tempkey2
CodePudding user response:
You can create mask by test missing values by Series.isna
and then in DataFrame.loc
add range
by RangeIndex
with count number of True
s by sum
:
m = df['Job_Code'].isna()
df.loc[m, 'Job_Code'] = 'tempkey' pd.RangeIndex(1,m.sum() 1).astype(str)
print (df)
Client Job_Code
0 Microsoft J001
1 Google J002
2 Apple tempkey1
3 StackOverflow J003
4 PostGres tempkey2
Performance:
df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})
#50000 rows
df = pd.concat([df] * 10000, ignore_index=True)
print (df)
In [23]: %%timeit
...: c=1
...: for i, row in df.iterrows():
...: if pd.isnull(row['Job_Code']):
...: df.at[i,'Job_Code'] = "tempkey" str(c)
...: c= c 1
...:
4.35 s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [24]: %%timeit
...: mask = df['Job_Code'].isna()
...: tempkey = (df['Job_Code'].groupby(mask).cumcount() 1).map('tempkey{}'.format)
...: df['Job_Code'] = df['Job_Code'].fillna(tempkey)
...:
28.1 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [25]: %%timeit
...: m = df['Job_Code'].isna()
...: df.loc[m, 'Job_Code'] = 'tempkey' pd.RangeIndex(1,m.sum() 1).astype(str)
...:
2.86 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
In the end I used this and it worked, but I'd accept a better answer, one that isn't iterrows reliant.
c=1
for i, row in df.iterrows():
if pd.isnull(row['Job_Code']):
df.at[i,'Job_Code'] = "tempkey" str(c)
c= c 1