Is there any efficient way to do groupby and perform forward fill in a large dataframe? I have a dataframe with 2768186062 rows which looks like this:
email email_created_date country
[email protected] 2021-03-27 00:44:49 00:00 NaN
[email protected] 2022-01-14 16:27:18 00:00 IND
[email protected] 2021-05-14 13:24:15 00:00 NaN
[email protected] 2021-09-16 16:09:45 00:00 IND
[email protected] 2021-03-27 00:38:37 00:00 NaN
[email protected] 2022-01-14 12:25:28 00:00 IND
[email protected] 2022-09-22 22:08:50 00:00 NaN
[email protected] 2022-09-22 22:11:51 00:00 CA
I wanted to do backward fill in the column country based on the same email. So the expected output looks like:
email email_created_date country
[email protected] 2021-03-27 00:44:49 00:00 IND
[email protected] 2022-01-14 16:27:18 00:00 IND
[email protected] 2021-05-14 13:24:15 00:00 IND
[email protected] 2021-09-16 16:09:45 00:00 IND
[email protected] 2021-03-27 00:38:37 00:00 IND
[email protected] 2022-01-14 12:25:28 00:00 IND
[email protected] 2022-09-22 22:08:50 00:00 CA
[email protected] 2022-09-22 22:11:51 00:00 CA
I have done it using the following code:
dt1[['country']]=(dt1
.sort_values(['email','email_created_date'])
.groupby(['email'], as_index=False)[['country']].fillna(method='bfill'))
but it taking a lot of time more than an hour to exceute the above code as the dataframe is very big. Could anyone suggest/help to perform the above operation in the most efficient way? I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory). Please help!
CodePudding user response:
Here's an example of how you could improve the efficiency:
def bfill_country(df):
return df['country'].bfill()
dt1['country'] = dt1.sort_values(['email', 'email_created_date']).groupby('email').apply(bfill_country)
This code first defines a function bfill_country that takes a DataFrame as input and returns a series with the country column backward filled. Then, it uses the groupby and apply methods to apply this function to each group separately, and assigns the result back to the country column of the original DataFrame.
This should be faster than the solution you proposed, because it avoids filling all the missing values at once and instead fills them one group at a time.
Let me know if it helped!
CodePudding user response:
It may be faster to avoid sorting and grouping the data at all. If you can safely assume that each email
corresponds to only one country
, then you can construct a mapping from email to country and use that to assign all country
values.
from io import StringIO
import pandas as pd
s = """\
email email_created_date country
[email protected] 2021-03-27 00:44:49 00:00 NaN
[email protected] 2022-01-14 16:27:18 00:00 IND
[email protected] 2021-05-14 13:24:15 00:00 NaN
[email protected] 2021-09-16 16:09:45 00:00 IND
[email protected] 2021-03-27 00:38:37 00:00 NaN
[email protected] 2022-01-14 12:25:28 00:00 IND
[email protected] 2022-09-22 22:08:50 00:00 NaN
[email protected] 2022-09-22 22:11:51 00:00 CA
"""
# Load the example data
df = pd.read_csv(StringIO(s), sep='\s\s ', parse_dates=True, engine='python', dtype={'country': 'category'})
# Construct a lookup table from email -> country
valid_rows = df.loc[~df['country'].isnull(), ['email', 'country']]
country_mapping = valid_rows.drop_duplicates('email').set_index('email')['country']
# Apply the lookup table.
df['country'] = df['email'].map(country_mapping)
print(df)
email email_created_date country
0 [email protected] 2021-03-27 00:44:49 00:00 IND
1 [email protected] 2022-01-14 16:27:18 00:00 IND
2 [email protected] 2021-05-14 13:24:15 00:00 IND
3 [email protected] 2021-09-16 16:09:45 00:00 IND
4 [email protected] 2021-03-27 00:38:37 00:00 IND
5 [email protected] 2022-01-14 12:25:28 00:00 IND
6 [email protected] 2022-09-22 22:08:50 00:00 CA
7 [email protected] 2022-09-22 22:11:51 00:00 CA