Home > Enterprise >  Pandas equivelt of pyspark reduce and add?
Pandas equivelt of pyspark reduce and add?

Time:12-03

I have a dataframe in the following where Day_1, Day_2, Day_3 are the number of impressions in the past 3 days.

df = pd.DataFrame({'Day_1': [2, 4, 8, 0],
                   'Day_2': [2, 0, 0, 0],
                   'Day_3': [1, 1, 0, 0],
                  index=['user1', 'user2', 'user3', 'user4'])
df
      Day_1 Day_2 Day_3 
user1   2    2      1   
user2   4    0      1   
user3   8    0      0   
user4   0    0      0   

Now, I need to check if a user had any impression in the past n days. For example, if num_days = 2, I need to add a new column, impression, where it gets 1 if sum Day_1 and Day_2 is greater than zero, and 0 otherwise. Here is what I expect to see:

      Day_1 Day_2 Day_3 impression
user1   2    2      1   1
user2   4    0      1   1
user3   8    0      0   1
user4   0    0      0   0

It is a straightforward process in pyspark and I use something like this:

imp_cols = ['Day_' str(i) for i in range(1, num_days 1)]
df = df.withColumn("impression",reduce(add, [F.col(x) for x in imp_cols]))

CodePudding user response:

You can use the DataFrame.loc method to select the columns you want to sum, and then use the DataFrame.sum method to compute the sum of these columns. You can then use the DataFrame.clip method to set values less than 1 to 0 and values greater than or equal to 1 to 1. Finally, you can use the DataFrame.assign method to add the new impression column to the dataframe.

import pandas as pd

df = pd.DataFrame({'Day_1': [2, 4, 8, 0],
                   'Day_2': [2, 0, 0, 0],
                   'Day_3': [1, 1, 0, 0],
                  index=['user1', 'user2', 'user3', 'user4'])

num_days = 2
imp_cols = ['Day_' str(i) for i in range(1, num_days 1)]

df = df.loc[:, imp_cols].sum(axis=1).clip(0, 1).to_frame("impression")

df = df.assign(impression=impression)

CodePudding user response:

IIUC, you can use numpy.where with pandas.DataFrame.sum.

Try this :

df["impression"] = np.where(df.sum(axis=1).gt(0), 1, 0)

# Output :

print(df)
​
       Day_1  Day_2  Day_3  impression
user1      2      2      1           1
user2      4      0      1           1
user3      8      0      0           1
user4      0      0      0           0

If you want to select a specific columns/days, you can use pandas.DataFrame.filter :

num_days = 2
l = list(range(1, num_days 1))
pat= "|".join([str(x) for x in l])

sub_df = df.filter(regex="Day_[{}]".format(pat))

df["impression"] = np.where(sub_df.sum(axis=1).gt(0), 1, 0)
  • Related