In a large data set (300K rows) with an Id column (primary key), I append new rows of data. New rows do not have an Id, which I am having difficulty adding. Each new Id should be an incremented value for each row - adding 1 to the maximum value in the Id column.
The data looks something like this:
import pandas as pd
# example data frame
inp = [{'Id': 0, 'Col1': 1, 'Col2': 7},
{'Id': 1, 'Col1': 1, 'Col2': 8},
{'Id': 2, 'Col1': 3, 'Col2': 9},
{'Id': '', 'Col1': 1, 'Col2': 10},
{'Id': 4, 'Col1': 5, 'Col2': 11},
{'Id': '', 'Col1': 1, 'Col2': 12}
]
df = pd.DataFrame(inp)
# format to be like my real data
df["Id"] = pd.to_numeric(df["Id"], errors='coerce')
df["Id"] = df["Id"].astype("Int64")
print(df)
Id Col1 Col2
0 0 1 7
1 1 1 8
2 2 3 9
3 <NA> 1 10
4 4 5 11
5 <NA> 1 12
Needed:
Id Col1 Col2
0 0 1 7
1 1 1 8
2 2 3 9
3 5 1 10
4 4 5 11
5 6 1 12
Fails:
df["Id"] = np.select([df["Id"].isna()], [df["Id"].max() 1], default=df["Id"])
print(df
Id Col1 Col2
0 0 1 7
1 1 1 8
2 2 3 9
3 5 1 10
4 4 5 11
5 5 1 12
df["Id"] = df.apply(lambda x: df["Id"].max() 1 if ~isinstance(x["Id"], int) else x["Id"], axis=1)
Id Col1 Col2
0 5 1 7
1 5 1 8
2 5 3 9
3 5 1 10
4 5 5 11
5 5 1 12
df.sort_values(by=["Id"], inplace=True)
df.set_index("Id", inplace=True)
Col1 Col2
Id
0 1 7
1 1 8
2 3 9
4 5 11
<NA> 1 10
<NA> 1 12
I'm not sure if I'm even on the right track (this seems like such an obvious thing to do, but I can't find it described from the perspective I'm taking). Lambda and looping techniques were also very slow. Am I missing some simple function to do exactly this?!?
CodePudding user response:
Try:
- add
max()
tocumsum
onisna
fillna
with the above
Code:
df['Id'] = df['Id'].fillna(df['Id'].isna().cumsum() df['Id'].max())
Output:
Id Col1 Col2
0 0 1 7
1 1 1 8
2 2 3 9
3 5 1 10
4 4 5 11
5 6 1 12