How to create a function that takes the sum of three columns but after each iteration a column chang-CodePudding

I have the following DataFrame called df:

Numeric	Country1	Country2	Country3	Country4	Country5
5	Yes	No	No	Yes	Yes
10	No	Yes	Yes	No	Yes
6	Yes	No	No	Yes	No
8	Yes	No	Yes	Yes	Yes
9	No	Yes	No	Yes	No

I would like to find a way to turn the code below into a loop or function as I have many rows and many columns in my DataFrame. The first line of code below takes the sum of column Numeric, and the sum of column Country1 == "YES" values only and it inputs the output to the data frame below under row Country1 and column Country1.

The second line of the code it will take the sum of column Numeric, the sum of Colum Country1 that has YEs values, and the sum of column Country2 == "Yes". The output of this line gets entered into row name Country2 and column name Country1.

This pattern continues for row Country3, Country4 and Country 5. I found a way to hard coded but I would like to find a more efficient way of doing it.

df1.loc["Country1", "Country1"] = df.loc[(df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country2", "Country1"] = df.loc[(df['Country2'] == "Yes") & (df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country3", "Country1"] = df.loc[(df['Country3'] == "Yes") &(df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country4", "Country1"] = df.loc[(df['Country4'] == "Yes") &(df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country5", "Country1"] = df.loc[(df['Country5'] == "Yes") &(df['Country1'] == "Yes"), 'Numeric'].sum()

The desired output will look like the DataFrame below. I just created the dataframe to show where each values goes but the numbers entered below are not accurate. Is just to illustrate an example.

	Country1	Country2	Country3	Country4	Country5
Country1	5	3	19	8	8
Country2	10	0	20	7	7
Country3	15	8	1	4	7
Country4	20	9	3	8	5
Country5	2	10	5	9	4

Thank you!

CodePudding user response：

This can be done with masking the columns with Yes, then use np.einsum to get True or False for each combinations of two columns for each row. Multiply by the values in Numeric column and sum over the axis that represents the rows (in my choice of the subscripts in einsum it is the last one). Create a dataframe with index and columns as wanted

m = df.drop(['Numeric'], axis=1).eq('Yes')
print(m)
#    Country1  Country2  Country3  Country4  Country5
# 0      True     False     False      True      True
# 1     False      True      True     False      True
# 2      True     False     False      True     False
# 3      True     False      True      True      True
# 4     False      True     False      True     False

res = pd.DataFrame(
    data=(np.einsum('ij,ik->jki',m,m)
            *df['Numeric'].to_numpy()[None,:]
         ).sum(axis=-1),
    index=m.columns, 
    columns = m.columns
)
print(res)
#           Country1  Country2  Country3  Country4  Country5
# Country1        19         0         8        19        13
# Country2         0        19        10         9        10
# Country3         8        10        18         8        18
# Country4        19         9         8        28        13
# Country5        13        10        18        13        23

Note: you could even do data = np.einsum('ij,ik,i->jk',m,m,df['Numeric'].to_numpy()) which includes the multiplication with the numeric column and the sum over the last axis directly.

EDIT: here is a way using combinations_with_replacement that might be easier to understand what is happening (but the original answer with einsum does the same style of operations).

from itertools import combinations_with_replacement

#split mask and values
m = df.drop(['Numeric'], axis=1).eq('Yes')
arr = df['Numeric'].to_numpy()

# create result dataframe
res = pd.DataFrame(data=0, index=m.columns, columns=m.columns)
# fill each cell
for colA, colB in combinations_with_replacement(m.columns, 2):
    res.loc[colA, colB] = res.loc[colB, colA] = ((m[colA]&m[colB])*arr).sum()

print(res)
# same result

CodePudding user response：

Here's one way:

from itertools import product

country_cols = [c for c in df.columns if c.startswith("Country")]
df1 = pd.DataFrame(None, index=country_cols, columns=country_cols)
combs = product(country_cols, country_cols)
for comb in combs:
    matches = (df[list(comb)] == 'Yes').all(axis=1)
    total = df.loc[matches, 'Numeric'].sum()
    df1.loc[comb[0], comb[1]] = total
print(df1)

Out:

         Country1 Country2 Country3 Country4 Country5
Country1       19        0        8       19       13
Country2        0       19       10        9       10
Country3        8       10       18        8       18
Country4       19        9        8       28       13
Country5       13       10       18       13       23

Is this what you wanted?

Note, the resulting matrix is symmetrical. If you are happy to only calculate the unique results (no duplication) use:

from itertools import combinations
combs = combinations(country_cols, 2)
...

Numeric	Country1	Country2	Country3	Country4	Country5
5	Yes	No	No	Yes	Yes
10	No	Yes	Yes	No	Yes
6	Yes	No	No	Yes	No
8	Yes	No	Yes	Yes	Yes
9	No	Yes	No	Yes	No

Numeric	Country1	Country2	Country3	Country4	Country5
5	Yes	No	No	Yes	Yes
10	No	Yes	Yes	No	Yes
6	Yes	No	No	Yes	No
8	Yes	No	Yes	Yes	Yes
9	No	Yes	No	Yes	No

Numeric	Country1	Country2	Country3	Country4	Country5
5	Yes	No	No	Yes	Yes
10	No	Yes	Yes	No	Yes
6	Yes	No	No	Yes	No
8	Yes	No	Yes	Yes	Yes
9	No	Yes	No	Yes	No