What is the best way to apply a row-wise function and create multiple new columns?
I have two dataframes and a working code, but it's most likely not optimal
df1 (dataframe has thousands of rows and xx number of columns)
sic | data1 | data2 | data3 | data4 | data5 |
---|---|---|---|---|---|
5 | 0.90783598 | 0.84722083 | 0.47149924 | 0.98724123 | 0.50654476 |
6 | 0.53442684 | 0.59730371 | 0.92486887 | 0.61531646 | 0.62784041 |
3 | 0.56806423 | 0.09619383 | 0.33846097 | 0.71878313 | 0.96316724 |
8 | 0.86933042 | 0.64965755 | 0.94549745 | 0.08866519 | 0.92156389 |
12 | 0.651328 | 0.37193774 | 0.9679044 | 0.36898991 | 0.15161838 |
6 | 0.24555531 | 0.50195983 | 0.79114578 | 0.9290596 | 0.10672607 |
df2 (column header maps to the sic-code in df1. There are in total 12 sic-codes and the dataframe is thousands of rows long)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
c_bar | 0.4955329 | 0.92970292 | 0.68049726 | 0.91325006 | 0.55578465 | 0.78056519 | 0.53954711 | 0.90335326 | 0.93986402 | 0.0204794 | 0.51575764 | 0.61144255 |
a1_bar | 0.75781444 | 0.81052669 | 0.99910449 | 0.62181902 | 0.11797144 | 0.40031316 | 0.08561665 | 0.35296894 | 0.14445697 | 0.93799762 | 0.80641802 | 0.31379671 |
a2_bar | 0.41432552 | 0.36313911 | 0.13091618 | 0.39251953 | 0.66249636 | 0.31221897 | 0.15988528 | 0.1620938 | 0.55143589 | 0.66571044 | 0.68198944 | 0.23806947 |
a3_bar | 0.38918855 | 0.83689178 | 0.15838139 | 0.39943204 | 0.48615188 | 0.06299899 | 0.86343819 | 0.47975619 | 0.05300611 | 0.15080875 | 0.73088725 | 0.3500239 |
a4_bar | 0.47201384 | 0.90874121 | 0.50417142 | 0.70047698 | 0.24820601 | 0.34302454 | 0.4650635 | 0.0992668 | 0.55142391 | 0.82947194 | 0.28251699 | 0.53170308 |
I achieved the result I need with the following code:
ind_list = np.arange(1,13) # Create list of industries
def c_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['const',i]
def a1_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a1bar',i]
def a2_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a2bar',i]
def a3_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a3bar',i]
def a4_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a4bar',i]
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')
The output is something like this:
sic | data1 | data2 | data3 | data4 | c_bar | a1_bar | a2_bar | a3_bar | a4_bar |
---|---|---|---|---|---|---|---|---|---|
5 | 0.10316948 | 0.61408639 | 0.04042675 | 0.79255749 | 0.56357931 | 0.42920472 | 0.20701581 | 0.67639811 | 0.37778029 |
6 | 0.5730904 | 0.16753145 | 0.27835136 | 0.00178992 | 0.51793793 | 0.06772307 | 0.15084885 | 0.12451806 | 0.33114948 |
3 | 0.87710893 | 0.66834187 | 0.14286608 | 0.12609769 | 0.75873957 | 0.72586804 | 0.6081763 | 0.14598001 | 0.21557266 |
8 | 0.24565579 | 0.56195558 | 0.93316676 | 0.20988936 | 0.67404545 | 0.65221594 | 0.79758557 | 0.67093021 | 0.33400764 |
12 | 0.79703344 | 0.61066111 | 0.94602909 | 0.56218703 | 0.92384307 | 0.30836159 | 0.72521994 | 0.00795362 | 0.76348227 |
6 | 0.86604791 | 0.28454782 | 0.97229172 | 0.21853932 | 0.75650652 | 0.40788056 | 0.53233553 | 0.60326386 | 0.27399405 |
Cell values in the example are randomly generated, but the point is to map based on sic-codes and add rows from df2 as new columns into df1.
CodePudding user response:
To do this, you need to:
- Transpose
df2
so that its columns are correct for concatenation - Index it with the
df1["sic"]
column to get the correct rows - Reset the index of the obtained rows of
df2
using.reset_index(drop=True)
, so that the dataframes can be concatenated correctly. (This replaces the current index e.g.5, 6, 3, 8, 12, 6
with a new one e.g.0, 1, 2, 3, 4, 5
while keeping the actual values the same. This is so that pandas doesn't get confused while concatenating them) - Concatenate the two dataframes
Note: I used a method based off of this to read in the dataframe, and it assumed that the columns of df2
were strings but the values of the sic
column of df1
were ints. Therefore I used .astype(str)
to get step 2 working. If this is not actually the case, you may need to remove the .astype(str)
.
Here is the single line of code to do these things:
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
Here is the full code I used:
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO("""
sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308
"""), sep="\t", index_col=[0])
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
print(merged)
which produces the output:
sic data1 data2 data3 ... a1_bar a2_bar a3_bar a4_bar
0 5 0.907836 0.847221 0.471499 ... 0.117971 0.662496 0.486152 0.248206
1 6 0.534427 0.597304 0.924869 ... 0.400313 0.312219 0.062999 0.343025
2 3 0.568064 0.096194 0.338461 ... 0.999104 0.130916 0.158381 0.504171
3 8 0.869330 0.649658 0.945497 ... 0.352969 0.162094 0.479756 0.099267
4 12 0.651328 0.371938 0.967904 ... 0.313797 0.238069 0.350024 0.531703
5 6 0.245555 0.501960 0.791146 ... 0.400313 0.312219 0.062999 0.343025
[6 rows x 11 columns]
CodePudding user response:
Try transposing df2 and applying transformations to it. Transposing a data frame means converting the rows into columns of your data frame.
df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)
then, you can use concatenate the transformed columns of df2 with the columns of df1, using df1 = pd.concat([df1,df2],axis=1)
.