Home > Enterprise >  Repetitive column-wide multiplication and division in Python
Repetitive column-wide multiplication and division in Python

Time:02-14

I am trying to do some column-wide multiplication and division in groups and join them together. But I'd like to do it at scale - currently it's a bit repetitive - columns with A string divide by columns with D string and multiply by WC column and then repeat a similar process on group columns B and C towards columns D and columns WC. At the end, I will merge them into the same dataframe. How can I make the process more efficient?

input:

df = pd.DataFrame({"cid" : {0 : "cd1", 1 : "cd2", 2 : "cd3"},
                   "A1970" : {0 : 3.2, 1 : 3.5, 2 : .4},
                   "A1980" : {0 : 3.1, 1 : 3.6, 2 : .5},
                   "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
                   "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
                   "C1970" : {0 : 3.2, 1 : 3.3, 2 : .3},
                   "C1980" : {0 : 3.3, 1 : 3.4, 2 : .3},
                   "D1970" : {0 : 2.4, 1 : 1.3, 2 : .7},
                   "D1980" : {0 : 3.2, 1 : 1.3, 2 : .2},
                   "WC" : {0 :0.5, 1 : 0.3, 2 : .1}
                  }).set_index(['cid'])

#      A1970  A1980  B1970  B1980  C1970  C1980  D1970  D1980   WC
# cid                                                             
# cd1    3.2    3.1    2.5    3.2    3.2    3.3    2.4    3.2  0.5
# cd2    3.5    3.6    1.2    1.3    3.3    3.4    1.3    1.3  0.3
# cd3    0.4    0.5    0.7    0.1    0.3    0.3    0.7    0.2  0.1

processing:

df_a = (df.filter(regex='A')
        .div(df.filter(regex='D').values)
        .multiply(df["WC"], axis="index")
        .add_suffix("_rt"))

#      A1970_rt  A1980_rt
# cid                    
# cd1  0.666667  0.484375
# cd2  0.807692  0.830769
# cd3  0.057143  0.250000

df_b = (df.filter(regex='B')
        .div(df.filter(regex='D').values)
        .multiply(df["WC"], axis="index")
        .add_suffix("_rt"))

df_c = (df.filter(regex='C')
        .div(df.filter(regex='D').values)
        .multiply(df["WC"], axis="index")
        .add_suffix("_rt"))

CodePudding user response:

Solution with MultiIndex for possible match columns by years after A,B,C and D:

df1 = df.set_index('WC', append=True)
df1.columns = pd.MultiIndex.from_frame(df1.columns.str.extract('(\D )(\d )', expand=True))

wc = df1.index.get_level_values('WC')
df1 = df1.loc[:, ['A','B','C']].div(df1.xs('D', axis=1)).mul(wc, axis='index')
df1.columns = df1.columns.map('_'.join)
df1 = df1.add_suffix("_rt").reset_index()
print (df1)
   cid   WC  A_1970_rt  A_1980_rt  B_1970_rt  B_1980_rt  C_1970_rt  C_1980_rt
0  cd1  0.5   0.666667   0.484375   0.520833       0.50   0.666667   0.515625
1  cd2  0.3   0.807692   0.830769   0.276923       0.30   0.761538   0.784615
2  cd3  0.1   0.057143   0.250000   0.100000       0.05   0.042857   0.150000

CodePudding user response:

Using the underlying numpy data:

creating a new dataframe:

regex = r'^[ABC]'

N = 3 # number of A/B/C columns
# or if needed to calculate it programmatically
# N = df.filter(regex=regex).shape[1] // df.filter(regex='^D').shape[1]

(df.filter(regex=regex)   # get A/B/C columns
   .mul(df['WC'], axis=0) # multiply by WC
   # divide by D (tiled to match the number of A/B/C)
   .div(np.tile(df.filter(regex='D').values, (1,N)))
   .add_suffix('_rt')     # rename columns
)

output:

     A1970_rt  A1980_rt  B1970_rt  B1980_rt  C1970_rt  C1980_rt
cid                                                            
cd1  0.666667  0.484375  0.520833      0.50  0.666667  0.515625
cd2  0.807692  0.830769  0.276923      0.30  0.761538  0.784615
cd3  0.057143  0.250000  0.100000      0.05  0.042857  0.150000

updating the original dataframe in place

df.update( # update the dataframe with the output from:
 df.filter(regex=regex)   # get A/B/C columns
   .mul(df['WC'], axis=0) # multiply by WC
   # divide by D (tiled to match the number of A/B/C)
   .div(np.tile(df.filter(regex='D').values, (1,N))) 
)

output:

        A1970     A1980     B1970  B1980     C1970     C1980  D1970  D1980   WC
cid                                                                            
cd1  0.666667  0.484375  0.520833   0.50  0.666667  0.515625    2.4    3.2  0.5
cd2  0.807692  0.830769  0.276923   0.30  0.761538  0.784615    1.3    1.3  0.3
cd3  0.057143  0.250000  0.100000   0.05  0.042857  0.150000    0.7    0.2  0.1

CodePudding user response:

You could convert df.filter(regex='D') to a numpy array then use np.tile to repeat it, then divide the "A"-"C" columns and multiply:

col_msk = ~df.columns.str.contains('D|WC')
out = df.loc[:, col_msk].div(np.tile(df.filter(regex='D').to_numpy(), (1, sum(col_msk)//2))).multiply(df["WC"], axis="index").add_suffix("_rt")

Output:

     A1970_rt  A1980_rt  B1970_rt  B1980_rt  C1970_rt  C1980_rt
cid                                                            
cd1  0.666667  0.484375  0.520833      0.50  0.666667  0.515625
cd2  0.807692  0.830769  0.276923      0.30  0.761538  0.784615
cd3  0.057143  0.250000  0.100000      0.05  0.042857  0.150000
  • Related