Home > Mobile >  In a Pandas Id column, how can I insert an incremented value (max 1) for each missing Id value
In a Pandas Id column, how can I insert an incremented value (max 1) for each missing Id value

Time:07-13

In a large data set (300K rows) with an Id column (primary key), I append new rows of data. New rows do not have an Id, which I am having difficulty adding. Each new Id should be an incremented value for each row - adding 1 to the maximum value in the Id column.

The data looks something like this:

import pandas as pd
# example data frame
inp = [{'Id': 0, 'Col1': 1, 'Col2': 7},
   {'Id': 1, 'Col1': 1, 'Col2': 8},
   {'Id': 2, 'Col1': 3, 'Col2': 9},
   {'Id': '', 'Col1': 1, 'Col2': 10}, 
   {'Id': 4, 'Col1': 5, 'Col2': 11},
   {'Id': '', 'Col1': 1, 'Col2': 12}
   ]
df = pd.DataFrame(inp)
# format to be like my real data
df["Id"] = pd.to_numeric(df["Id"], errors='coerce')
df["Id"] = df["Id"].astype("Int64")

print(df)
      Id  Col1  Col2
0     0     1     7
1     1     1     8
2     2     3     9
3  <NA>     1    10
4     4     5    11
5  <NA>     1    12

Needed:

      Id  Col1  Col2
0     0     1     7
1     1     1     8
2     2     3     9
3     5     1    10
4     4     5    11
5     6     1    12

Fails:

df["Id"] = np.select([df["Id"].isna()], [df["Id"].max()   1], default=df["Id"])
print(df
          Id  Col1  Col2
0     0     1     7
1     1     1     8
2     2     3     9
3     5     1    10
4     4     5    11
5     5     1    12

df["Id"] = df.apply(lambda x: df["Id"].max()   1 if ~isinstance(x["Id"], int) else x["Id"], axis=1)
   Id  Col1  Col2
0   5     1     7
1   5     1     8
2   5     3     9
3   5     1    10
4   5     5    11
5   5     1    12

df.sort_values(by=["Id"], inplace=True)
df.set_index("Id", inplace=True)
      Col1  Col2
Id              
0        1     7
1        1     8
2        3     9
4        5    11
<NA>     1    10
<NA>     1    12

I'm not sure if I'm even on the right track (this seems like such an obvious thing to do, but I can't find it described from the perspective I'm taking). Lambda and looping techniques were also very slow. Am I missing some simple function to do exactly this?!?

CodePudding user response:

Try:

  • add max() to cumsum on isna
  • fillna with the above

Code:

df['Id'] = df['Id'].fillna(df['Id'].isna().cumsum() df['Id'].max())

Output:

   Id  Col1  Col2
0   0     1     7
1   1     1     8
2   2     3     9
3   5     1    10
4   4     5    11
5   6     1    12
  • Related