I have two dataframe. The first dataframe have only one column: email, the first dataframe is a complete list of email. The second dataframe is a dataframe with three column: email, subscribe_or_unsubscribe, date. The second dataframe is a history of user subcribing or unsubscribing from the email system. The second dataframe is sorted by date with oldest date at index 0 and newest date at latest index.
If email in first dataframe does not exist in the second dataframe, then the email is not subscribed to email system. If the latest history of the email in the second dataframe is subscribe or unsubcribe then the email is subscribed or not subscribed respectively to the email system.
The second dataframe is a cropped version from the actual dataframe and contain error such that it is possible for user in the history to unsubscribe without previously subscribing.
I want to create a third dataframe with two column: email, is_subscribed. The third dataframe will contain the information whether the email is currently subscribed to the email system or not. Right now i'm using iterrows to iterate each rows in third dataframe and second dataframe, however iterrows is really slow, how do i do it with vectorization ?
For example, consider the following example of first dataframe and second dataframe
import numpy as np
import pandas as pd
first_dataframe = pd.DataFrame(["[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]" , "[email protected]", "[email protected]"], columns=['email'])
first_dataframe
# Year-Month-Day
second_dataframe = pd.DataFrame([['[email protected]', 'subscribe', '2020-12-26'], ['[email protected]', 'subscribe', '2021-06-06'], \
['[email protected]', 'unsubscribe', '2021-02-18'], ['[email protected]', 'unsubscribe', '2020-08-17'], \
['[email protected]', 'subscribe', '2022-01-08'], ['[email protected]', 'unsubscribe', '2022-03-10'], \
['[email protected]', 'subscribe', '2022-05-26']], columns=['email', "subscribe_or_unsubscribe", "date"])
second_dataframe['date'] = pd.to_datetime(second_dataframe['date'])
second_dataframe.sort_values(by='date', inplace=True)
second_dataframe
Then i create third dataframe which is a copy of the first dataframe with new column "is_subscribed" added with default value "unsubscribed"
third_dataframe = first_dataframe.copy(deep=True)
third_dataframe['is_subscribed'] = "unsubscribed"
third_dataframe
Then i use iterrows to know whether the email is currently subscribed to the email system or not
# The data is sorted from the oldest to newest
for i, third_dataframe_row in third_dataframe.iterrows():
for j, second_dataframe_row in second_dataframe.iterrows():
if third_dataframe_row['email'] == second_dataframe_row['email']:
third_dataframe.at[i,'is_subscribed'] = second_dataframe_row['subscribe_or_unsubscribe'] "d"
third_dataframe
How to do the above iterrows with vectorization?
CodePudding user response:
Use Series.map
by Series
created by DataFrame.drop_duplicates
for latest dates
(because already sorted) and convert email
to index, add d
and last replace non matched values by default value in Series.fillna
:
s = (second_dataframe.drop_duplicates('email', keep='last')
.set_index('email')['subscribe_or_unsubscribe'])
third_dataframe['is_subscribed'] = (third_dataframe['email'].map(s) "d")
.fillna('unsubscribed')
print (third_dataframe)
email is_subscribed
0 [email protected] unsubscribed
1 [email protected] unsubscribed
2 [email protected] subscribed
3 [email protected] unsubscribed
4 [email protected] subscribed
5 [email protected] unsubscribed
6 [email protected] unsubscribed
7 [email protected] unsubscribed
8 [email protected] subscribed