What I am looking for is to put for each ID the most current description (as long as it is not an empty cell, and if it’s is empty, it should be the first non-empty description). I have sorted the DF by ID and by Date, so for each ID "group", the first description is the most current.
The problem comes when I have to take that description and replace it in the rest of the rows of the same ID. The process with a FOR loop takes me more than 30 minutes, so I need a much more efficient solution.
So far, my procedure has been:
- list unique IDs
- Iterate with a loop for those IDs, and with a '.loc' take out the description field:
- If the most recent description is null, I put an
if
to catch the second description field
for id in list(df.columnaid.unique()):
if df.loc[(df.columnaid == id).description.unique()[0] != "":
description = df.loc[(df.columnaid == id).description.unique()[0]
elif df.loc[(df.columnaid == id).description.unique()[0] == "" and len(df.loc[(df.columnaid == id).description.unique()) >1:
description = df.loc[(df.columnaid == id).description.unique()[1]
- Save the ID product and the description in a dictionary:
dicc[id] = dicc.get(id, description)
Then, with a.loc
, an .isin
and a map
I replace the values obtained in the description column
This procedure Works, but it’s not optimal at all, and I need to know how could it be done a better way without taking more than 30 min.
df.loc[df['columnaid'].isin(dicc.keys()), 'description'] = df['columnaid'].map(dicc)
An example of the DataFrame (it would be the same but with millions of rows) is:
df = pd.DataFrame({"columnaid": ["2321fdsf", "2321fdsf", "3gsdfer3", "4gdsfg44", "4gdsfg44", "4gdsfg44", "7fg45d"],
"date": ["2022-11-16","2022-10-07","2022-09-02","2021-12-04","2021-09-23","2021-03-06","2021-03-15"],
"description": ["aaa", "bbb", "abc", "eee", "", "aqwert", "yuiop"],
})
columnaid date description
0 2321fdsf 2022-11-16 aaa
1 2321fdsf 2022-10-07 bbb
2 3gsdfer3 2022-09-02 abc
3 4gdsfg44 2021-12-04 eee
4 4gdsfg44 2021-09-23
5 4gdsfg44 2021-03-06 aqwert
6 7fg45d 2021-03-15 yuiop
The outcome should be:
columnaid date description
0 2321fdsf 2022-11-16 aaa
1 2321fdsf 2022-10-07 aaa
2 3gsdfer3 2022-09-02 abc
3 4gdsfg44 2021-12-04 eee
4 4gdsfg44 2021-09-23 eee
5 4gdsfg44 2021-03-15 eee
6 7fg45d 2021-03-06 yuiop
Thank you
CodePudding user response:
Sure thing – use groupby
:
import pandas as pd
df = pd.DataFrame(
{
"columnaid": ["2321fdsf", "2321fdsf", "3gsdfer3", "4gdsfg44", "4gdsfg44", "4gdsfg44", "7fg45d"],
"date": ["2022-11-16", "2022-10-07", "2022-09-02", "2021-12-04", "2021-09-23", "2021-03-06", "2021-03-15"],
"description": ["aaa", "bbb", "abc", "eee", "", "aqwert", "yuiop"],
}
)
# Convert date so we can `idxmax` it
df["date"] = pd.to_datetime(df["date"])
# Find newest descriptions per columnaid into an indexed series
newest_descriptions = df.groupby("columnaid").apply(lambda x: x.loc[x["date"].idxmax(), "description"])
# (Print for debugging)
print(newest_descriptions)
# Map the descriptions back into the original df
df["description"] = df["columnaid"].map(newest_descriptions)
print(df)
This prints out
columnaid
2321fdsf aaa
3gsdfer3 abc
4gdsfg44 eee
7fg45d yuiop
dtype: object
columnaid date description
0 2321fdsf 2022-11-16 aaa
1 2321fdsf 2022-10-07 aaa
2 3gsdfer3 2022-09-02 abc
3 4gdsfg44 2021-12-04 eee
4 4gdsfg44 2021-09-23 eee
5 4gdsfg44 2021-03-06 eee
6 7fg45d 2021-03-15 yuiop
CodePudding user response:
Unless I misunderstand or you are overthinking. Your question can be done in three simple steps.
#drop n/a
dfn = df.dropna(subset=['description']).copy()
#now you have a clean `dfn` simply sort then pickup the 1st observation in each group
uniq = dfn.groupby('colummaid')['description'].first().reset_index('first_obs')
#now you can merge uniq back to df
df = pd.merge(df, uniq, how='left', on='columnaid')
The column first_obs
is your expected output