I have the following DataFrame called df:
Numeric | Country1 | Country2 | Country3 | Country4 | Country5 |
---|---|---|---|---|---|
5 | Yes | No | No | Yes | Yes |
10 | No | Yes | Yes | No | Yes |
6 | Yes | No | No | Yes | No |
8 | Yes | No | Yes | Yes | Yes |
9 | No | Yes | No | Yes | No |
I would like to find a way to turn the code below into a loop or function as I have many rows and many columns in my DataFrame. The first line of code below takes the sum of column Numeric, and the sum of column Country1 == "YES" values only and it inputs the output to the data frame below under row Country1 and column Country1.
The second line of the code it will take the sum of column Numeric, the sum of Colum Country1 that has YEs values, and the sum of column Country2 == "Yes". The output of this line gets entered into row name Country2 and column name Country1.
This pattern continues for row Country3, Country4 and Country 5. I found a way to hard coded but I would like to find a more efficient way of doing it.
df1.loc["Country1", "Country1"] = df.loc[(df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country2", "Country1"] = df.loc[(df['Country2'] == "Yes") & (df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country3", "Country1"] = df.loc[(df['Country3'] == "Yes") &(df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country4", "Country1"] = df.loc[(df['Country4'] == "Yes") &(df['Country1'] == "Yes"), 'Numeric'].sum()
df1.loc["Country5", "Country1"] = df.loc[(df['Country5'] == "Yes") &(df['Country1'] == "Yes"), 'Numeric'].sum()
The desired output will look like the DataFrame below. I just created the dataframe to show where each values goes but the numbers entered below are not accurate. Is just to illustrate an example.
Country1 | Country2 | Country3 | Country4 | Country5 | |
---|---|---|---|---|---|
Country1 | 5 | 3 | 19 | 8 | 8 |
Country2 | 10 | 0 | 20 | 7 | 7 |
Country3 | 15 | 8 | 1 | 4 | 7 |
Country4 | 20 | 9 | 3 | 8 | 5 |
Country5 | 2 | 10 | 5 | 9 | 4 |
Thank you!
CodePudding user response:
This can be done with masking the columns with Yes, then use np.einsum
to get True or False for each combinations of two columns for each row. Multiply by the values in Numeric column and sum
over the axis that represents the rows (in my choice of the subscripts in einsum
it is the last one). Create a dataframe with index and columns as wanted
m = df.drop(['Numeric'], axis=1).eq('Yes')
print(m)
# Country1 Country2 Country3 Country4 Country5
# 0 True False False True True
# 1 False True True False True
# 2 True False False True False
# 3 True False True True True
# 4 False True False True False
res = pd.DataFrame(
data=(np.einsum('ij,ik->jki',m,m)
*df['Numeric'].to_numpy()[None,:]
).sum(axis=-1),
index=m.columns,
columns = m.columns
)
print(res)
# Country1 Country2 Country3 Country4 Country5
# Country1 19 0 8 19 13
# Country2 0 19 10 9 10
# Country3 8 10 18 8 18
# Country4 19 9 8 28 13
# Country5 13 10 18 13 23
Note: you could even do data = np.einsum('ij,ik,i->jk',m,m,df['Numeric'].to_numpy())
which includes the multiplication with the numeric column and the sum over the last axis directly.
EDIT: here is a way using combinations_with_replacement
that might be easier to understand what is happening (but the original answer with einsum
does the same style of operations).
from itertools import combinations_with_replacement
#split mask and values
m = df.drop(['Numeric'], axis=1).eq('Yes')
arr = df['Numeric'].to_numpy()
# create result dataframe
res = pd.DataFrame(data=0, index=m.columns, columns=m.columns)
# fill each cell
for colA, colB in combinations_with_replacement(m.columns, 2):
res.loc[colA, colB] = res.loc[colB, colA] = ((m[colA]&m[colB])*arr).sum()
print(res)
# same result
CodePudding user response:
Here's one way:
from itertools import product
country_cols = [c for c in df.columns if c.startswith("Country")]
df1 = pd.DataFrame(None, index=country_cols, columns=country_cols)
combs = product(country_cols, country_cols)
for comb in combs:
matches = (df[list(comb)] == 'Yes').all(axis=1)
total = df.loc[matches, 'Numeric'].sum()
df1.loc[comb[0], comb[1]] = total
print(df1)
Out:
Country1 Country2 Country3 Country4 Country5
Country1 19 0 8 19 13
Country2 0 19 10 9 10
Country3 8 10 18 8 18
Country4 19 9 8 28 13
Country5 13 10 18 13 23
Is this what you wanted?
Note, the resulting matrix is symmetrical. If you are happy to only calculate the unique results (no duplication) use:
from itertools import combinations
combs = combinations(country_cols, 2)
...