Home > Enterprise >  Iterate millions of rows in pandas optimally
Iterate millions of rows in pandas optimally

Time:11-22

What I am looking for is to put for each ID the most current description (as long as it is not an empty cell, and if it’s is empty, it should be the first non-empty description). I have sorted the DF by ID and by Date, so for each ID "group", the first description is the most current.

The problem comes when I have to take that description and replace it in the rest of the rows of the same ID. The process with a FOR loop takes me more than 30 minutes, so I need a much more efficient solution.

So far, my procedure has been:

  • list unique IDs
  • Iterate with a loop for those IDs, and with a '.loc' take out the description field:
  • If the most recent description is null, I put an if to catch the second description field
for id in list(df.columnaid.unique()):  
    
    if df.loc[(df.columnaid == id).description.unique()[0] != "":
        description = df.loc[(df.columnaid == id).description.unique()[0]
        
        
    elif df.loc[(df.columnaid == id).description.unique()[0] == "" and len(df.loc[(df.columnaid == id).description.unique()) >1:
        description = df.loc[(df.columnaid == id).description.unique()[1]
        
  • Save the ID product and the description in a dictionary:
dicc[id] = dicc.get(id, description)

Then, with a.loc, an .isin and a map I replace the values obtained in the description column This procedure Works, but it’s not optimal at all, and I need to know how could it be done a better way without taking more than 30 min.

df.loc[df['columnaid'].isin(dicc.keys()), 'description'] = df['columnaid'].map(dicc)

An example of the DataFrame (it would be the same but with millions of rows) is:

df = pd.DataFrame({"columnaid": ["2321fdsf", "2321fdsf", "3gsdfer3", "4gdsfg44", "4gdsfg44", "4gdsfg44", "7fg45d"],
                 "date": ["2022-11-16","2022-10-07","2022-09-02","2021-12-04","2021-09-23","2021-03-06","2021-03-15"],
                 "description": ["aaa", "bbb", "abc", "eee", "", "aqwert", "yuiop"],
                 })



    columnaid      date     description
0   2321fdsf    2022-11-16  aaa
1   2321fdsf    2022-10-07  bbb
2   3gsdfer3    2022-09-02  abc
3   4gdsfg44    2021-12-04  eee
4   4gdsfg44    2021-09-23  
5   4gdsfg44    2021-03-06  aqwert
6   7fg45d      2021-03-15  yuiop

The outcome should be:

    columnaid      date     description
0   2321fdsf    2022-11-16  aaa
1   2321fdsf    2022-10-07  aaa
2   3gsdfer3    2022-09-02  abc
3   4gdsfg44    2021-12-04  eee
4   4gdsfg44    2021-09-23  eee
5   4gdsfg44    2021-03-15  eee
6   7fg45d      2021-03-06  yuiop

Thank you

CodePudding user response:

Sure thing – use groupby:

import pandas as pd

df = pd.DataFrame(
    {
        "columnaid": ["2321fdsf", "2321fdsf", "3gsdfer3", "4gdsfg44", "4gdsfg44", "4gdsfg44", "7fg45d"],
        "date": ["2022-11-16", "2022-10-07", "2022-09-02", "2021-12-04", "2021-09-23", "2021-03-06", "2021-03-15"],
        "description": ["aaa", "bbb", "abc", "eee", "", "aqwert", "yuiop"],
    }
)

# Convert date so we can `idxmax` it
df["date"] = pd.to_datetime(df["date"])

# Find newest descriptions per columnaid into an indexed series
newest_descriptions = df.groupby("columnaid").apply(lambda x: x.loc[x["date"].idxmax(), "description"])
# (Print for debugging)
print(newest_descriptions)

# Map the descriptions back into the original df
df["description"] = df["columnaid"].map(newest_descriptions)

print(df)

This prints out

columnaid
2321fdsf      aaa
3gsdfer3      abc
4gdsfg44      eee
7fg45d      yuiop
dtype: object

  columnaid       date description
0  2321fdsf 2022-11-16         aaa
1  2321fdsf 2022-10-07         aaa
2  3gsdfer3 2022-09-02         abc
3  4gdsfg44 2021-12-04         eee
4  4gdsfg44 2021-09-23         eee
5  4gdsfg44 2021-03-06         eee
6    7fg45d 2021-03-15       yuiop

CodePudding user response:

Unless I misunderstand or you are overthinking. Your question can be done in three simple steps.

#drop n/a
dfn = df.dropna(subset=['description']).copy()

#now you have a clean `dfn` simply sort then pickup the 1st observation in each group
uniq = dfn.groupby('colummaid')['description'].first().reset_index('first_obs')

#now you can merge uniq back to df
df = pd.merge(df, uniq, how='left', on='columnaid')

The column first_obs is your expected output

  • Related