join and groupby 2 dataframes-CodePudding

I have 2 data frames

df:ServicesA                         df:ServicesB
index      clientID    ServiceID_A  |  index    ClientID  ServiceID_B
0          A                 1      |  0          A           4
1          B                 2      |  1          B           5
2          C                 3      |  2          D           6
2          B                 4      |

I want to get a new df that shows those clients on df:ServicesB and how many servicesA those clients have. which would be something like this

index    ClientID With ServiceID_B        count(ServiceID_A)
0                   A                             1
1                   B                             2
2                   D                             0

I tried this:

I got a unique list of client ID's from df:ServicesB and create a new df called 'df'

df = pd.DataFrame(df_ServicesB['Client ID', ].unique(),columns=['Client ID'])

and then I use this lambda function that would take each client id from df and iterate through df:ServicesA count those rows that have a client id that mathes (which is equivalante to count the serviceidA on that dataframe

def getTotaldfServicesA(clientid):
  total_services = 0
  for index, row in df_ServicesA.iterrows():
    if row['ClientID'] == clientid:
      total_services  = 1
  return total_services

df['Total_servicesA'] = df.apply(lambda row: getTotaldfServicesA(row['ClientID']),axis=1)

I 'believe' the formula works but I say 'I believe' becuase I technically haven't seen the resulting df as df_ServicesA is a very long list to iterate through and the number of clients on df_ServicesB is also long making the iteration process exponential to the point that the lambda function has been run a couple of times but either has failed with no error or it has timeout overnight and I've gotten disconnected by the time I get to my computer to see the result. Currently it's been running for 4 hours and I still don't get a resulting df dataset/frame.

maybe there's a more pythonic way to do this? or maybe a way to make it more efficient so it doesn't take forever?

Thank you in advance

CodePudding user response：

Here you go: I tested your dateframe with csv files, its working:

sA = pd.read_csv("ServicesA.csv")
sB = pd.read_csv("ServicesB.csv")
count_dict = sA.groupby(['ClientID']).size().to_dict()
count_dict.setdefault('D',0)
sB['count(ServiceID_A)'] = sB.ClientID.replace(count_dict)

Let me know if this is not working

CodePudding user response：

You should always avoid is iterating over a DataFrame/ Series, it's very slow. It should be your last resource. There is almost always a better way. In this case, you want to look into groupby and merge. You should read these guides

This will be much faster than using iterrows

>>> res = (
      df_ServicesB.merge(df_ServicesA, how='left', on='ClientID')
                  .groupby('ClientID', as_index=False)
                  [['ServiceID_A']].count()      
)
>>> res.columns = ["ClientID With ServiceID_B", "count(ServiceID_A)"]
>>> res

  ClientID With ServiceID_B  count(ServiceID_A)
0                         A                   1
1                         B                   2
2                         D                   0

CodePudding user response：

You can group your first dataframe and count the groups:

# Assuming that df:ServicesA is df_a
grouped = df_a.groupby("clientID").count()
>>> grouped

        ServiceID_A
clientID    
       A    1
       B    2
       C    1

After that you can left outer join the second dataframe on the grouped with the merge function:

# Assuming that df:ServicesB is df_b
joined = df_b[["ClientID"]].merge(grouped, left_on="ClientID",
                                  right_on="clientID", how="left").fillna(0)
>>> joined
    clientID    ServiceID_A
0          A            1.0
1          B            2.0
2          D            0.0