Determine who had more than 80% of a given consignor-CodePudding

I have a problem. I want to investigate whether my assumption is correct that it can be said that receiver X frequently receives a delivery from consignor Y. I have grouped receiver and consignor for this purpose.

However, what I would like to do now, I would like to output to myself all receivers where the most frequent consignor contains more than 80% of all consignors in receiver .

For example we have receiver c this has received from consignor 2, 3 what. From consignor 2 only 1 packet and from consignor 3 4 packets. Thus he has received a total of 5 packets. Thus consignor 3 accepts 80% of all consignor within receiver c. Thus receiver c should be represented in a table. This should be done below for all others.

In simple terms, I want all receivers that have more than 80% of a given consignor.

How can I do this?

What I want

receiver  consignor  count_with_most_consignor  count_all
       c          3                          4          5
       d          2                          4          5

Dataframe

    consignor receiver
0           1        a
1           2        a
2           1        a
3           2        c
4           3        c
5           1        b
6           3        c
7           3        c
8           3        c
9           2        d
10          2        d
11          2        d
12          1        d
13          2        d

Code

import pandas as pd
d = {'consignor': [1, 2, 1, 2, 3, 1, 3, 3, 3, 2, 2, 2, 1, 2],  
     'receiver': ['a', 'a', 'a', 'c', 'c', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd']}
df = pd.DataFrame(data=d)
print(df)
df_  = df.groupby(['receiver','consignor']).size().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
print(df_)

X = 'c'
print(df_[df_['receiver'] == X].shape[1])
print(df_[df_['receiver'] == X])

Out

[OUT]
3
  receiver  consignor  counts
4        c          3       4
3        c          2       1

CodePudding user response：

`crosstab`

# frequency table
s = pd.crosstab(df['receiver'], df['consignor'])

# agg columns along axis=1
d = {
    'consignor': s.idxmax(1),
    'count_with_most_consignor': s.max(1),
    'count_all': s.sum(1)
}

(
    s.assign(**d) # assign the agg columns
    .drop(s.columns, axis=1) # drop the unwanted columns
    .query('count_with_most_consignor / count_all >= .8') # filter the rows
)

Result

consignor  consignor  count_with_most_consignor  count_all
receiver                                                  
b                  1                          1          1
c                  3                          4          5
d                  2                          4          5

CodePudding user response：

You can simply get all shares with value_counts(normalize=True)

In [13]: df.groupby("receiver").value_counts(normalize=True)
Out[13]: 
receiver  consignor
a         1            0.666667
          2            0.333333
b         1            1.000000
c         3            0.800000
          2            0.200000
d         2            0.800000
          1            0.200000
dtype: float64

And then filter for values above 0.8 e.g. like this:

In [15]: res = df.groupby("receiver").value_counts(normalize=True).where(lambda x: x > 0.8).dropna()
Out[15]: 
receiver  consignor
b         1            1.0
c         3            0.8
d         2            0.8
dtype: float64

The simplest way to get the rest of the info is calculating them seperately

In [40]: count_with_most_consignor = df.value_counts().rename("count_with_most_consignor")
Out[41]: 
consignor  receiver
2          d           4
3          c           4
1          a           2
           b           1
           d           1
2          a           1
           c           1
Name: count_with_most_consignor, dtype: int64

In [42]: count_all = df.groupby("receiver").size().rename("count_all")

In [43]: count_all
Out[43]: 
receiver
a    3
b    1
c    5
d    5
Name: count_all, dtype: int64

and join that up together.

In [39]: res.rename("share").to_frame().join(count_with_most_consignor).join(count_all).reset_index()
Out[39]: 
  receiver  consignor  share  count_with_most_consignor  count_all
0        b          1    1.0                          1          1
1        c          3    0.8                          4          5
2        d          2    0.8                          4          5

CodePudding user response：

I could come up only with this kinda tricky way:)

# First, we create a group column 
df["receiver consignor"] = df["receiver"]   "-"   df["consignor"].astype("str")

# Then we count "receiver consignor" values
df = df.merge(df["receiver consignor"].value_counts(), left_on="receiver consignor", right_index=True, suffixes=(None, "_count"))

# Now we count deliveries by receivers
df = df.merge(df["receiver"].value_counts(), left_on="receiver", right_index=True, suffixes=(None, "_count"))

# Now we can calculate the percentage
df["percentage"] = df["receiver consignor_count"] * 100 / df["receiver_count"]

# At last, we can filter out what we need now
df_result = df[df["percentage"] >= 80].groupby(["receiver", "consignor", "receiver consignor_count", "receiver_count"]).first().reset_index()[["receiver", "consignor", "receiver consignor_count", "receiver_count"]]

CodePudding user response：

A simple solution with dictionaries as follows:

import pandas as pd
d = {'consignor': [1, 2, 1, 2, 3, 1, 3, 3, 3, 2, 2, 2, 1, 2],
     'receiver': ['a', 'a', 'a', 'c', 'c', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd']}
df = pd.DataFrame(data=d)
df_  = df.groupby(['receiver','consignor']).size().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
df__ = df_[['receiver','counts']]
df___ = df__.groupby(['receiver']).aggregate('sum')
dic = df___.to_dict()
for row in range(len(df_)):
    if df_['counts'][row] / dic['counts'][df_['receiver'][row]] < 0.8:
        df_ = df_.drop(row)
print(df_)

Outputs:

 receiver  consignor  counts
4        c          3       4
6        d          2       4
2        b          1       1