I am trying to do some column-wide multiplication and division in groups and join them together. But I'd like to do it at scale - currently it's a bit repetitive - columns with A string divide by columns with D string and multiply by WC column and then repeat a similar process on group columns B and C towards columns D and columns WC. At the end, I will merge them into the same dataframe. How can I make the process more efficient?
input:
df = pd.DataFrame({"cid" : {0 : "cd1", 1 : "cd2", 2 : "cd3"},
"A1970" : {0 : 3.2, 1 : 3.5, 2 : .4},
"A1980" : {0 : 3.1, 1 : 3.6, 2 : .5},
"B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
"B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
"C1970" : {0 : 3.2, 1 : 3.3, 2 : .3},
"C1980" : {0 : 3.3, 1 : 3.4, 2 : .3},
"D1970" : {0 : 2.4, 1 : 1.3, 2 : .7},
"D1980" : {0 : 3.2, 1 : 1.3, 2 : .2},
"WC" : {0 :0.5, 1 : 0.3, 2 : .1}
}).set_index(['cid'])
# A1970 A1980 B1970 B1980 C1970 C1980 D1970 D1980 WC
# cid
# cd1 3.2 3.1 2.5 3.2 3.2 3.3 2.4 3.2 0.5
# cd2 3.5 3.6 1.2 1.3 3.3 3.4 1.3 1.3 0.3
# cd3 0.4 0.5 0.7 0.1 0.3 0.3 0.7 0.2 0.1
processing:
df_a = (df.filter(regex='A')
.div(df.filter(regex='D').values)
.multiply(df["WC"], axis="index")
.add_suffix("_rt"))
# A1970_rt A1980_rt
# cid
# cd1 0.666667 0.484375
# cd2 0.807692 0.830769
# cd3 0.057143 0.250000
df_b = (df.filter(regex='B')
.div(df.filter(regex='D').values)
.multiply(df["WC"], axis="index")
.add_suffix("_rt"))
df_c = (df.filter(regex='C')
.div(df.filter(regex='D').values)
.multiply(df["WC"], axis="index")
.add_suffix("_rt"))
CodePudding user response:
Solution with MultiIndex
for possible match columns by years after A,B,C
and D
:
df1 = df.set_index('WC', append=True)
df1.columns = pd.MultiIndex.from_frame(df1.columns.str.extract('(\D )(\d )', expand=True))
wc = df1.index.get_level_values('WC')
df1 = df1.loc[:, ['A','B','C']].div(df1.xs('D', axis=1)).mul(wc, axis='index')
df1.columns = df1.columns.map('_'.join)
df1 = df1.add_suffix("_rt").reset_index()
print (df1)
cid WC A_1970_rt A_1980_rt B_1970_rt B_1980_rt C_1970_rt C_1980_rt
0 cd1 0.5 0.666667 0.484375 0.520833 0.50 0.666667 0.515625
1 cd2 0.3 0.807692 0.830769 0.276923 0.30 0.761538 0.784615
2 cd3 0.1 0.057143 0.250000 0.100000 0.05 0.042857 0.150000
CodePudding user response:
Using the underlying numpy data:
creating a new dataframe:
regex = r'^[ABC]'
N = 3 # number of A/B/C columns
# or if needed to calculate it programmatically
# N = df.filter(regex=regex).shape[1] // df.filter(regex='^D').shape[1]
(df.filter(regex=regex) # get A/B/C columns
.mul(df['WC'], axis=0) # multiply by WC
# divide by D (tiled to match the number of A/B/C)
.div(np.tile(df.filter(regex='D').values, (1,N)))
.add_suffix('_rt') # rename columns
)
output:
A1970_rt A1980_rt B1970_rt B1980_rt C1970_rt C1980_rt
cid
cd1 0.666667 0.484375 0.520833 0.50 0.666667 0.515625
cd2 0.807692 0.830769 0.276923 0.30 0.761538 0.784615
cd3 0.057143 0.250000 0.100000 0.05 0.042857 0.150000
updating the original dataframe in place
df.update( # update the dataframe with the output from:
df.filter(regex=regex) # get A/B/C columns
.mul(df['WC'], axis=0) # multiply by WC
# divide by D (tiled to match the number of A/B/C)
.div(np.tile(df.filter(regex='D').values, (1,N)))
)
output:
A1970 A1980 B1970 B1980 C1970 C1980 D1970 D1980 WC
cid
cd1 0.666667 0.484375 0.520833 0.50 0.666667 0.515625 2.4 3.2 0.5
cd2 0.807692 0.830769 0.276923 0.30 0.761538 0.784615 1.3 1.3 0.3
cd3 0.057143 0.250000 0.100000 0.05 0.042857 0.150000 0.7 0.2 0.1
CodePudding user response:
You could convert df.filter(regex='D')
to a numpy array then use np.tile
to repeat it, then divide the "A"-"C" columns and multiply:
col_msk = ~df.columns.str.contains('D|WC')
out = df.loc[:, col_msk].div(np.tile(df.filter(regex='D').to_numpy(), (1, sum(col_msk)//2))).multiply(df["WC"], axis="index").add_suffix("_rt")
Output:
A1970_rt A1980_rt B1970_rt B1980_rt C1970_rt C1980_rt
cid
cd1 0.666667 0.484375 0.520833 0.50 0.666667 0.515625
cd2 0.807692 0.830769 0.276923 0.30 0.761538 0.784615
cd3 0.057143 0.250000 0.100000 0.05 0.042857 0.150000