Home > database >  Pandas apply row-wise a function and create multiple new columns
Pandas apply row-wise a function and create multiple new columns

Time:04-14

What is the best way to apply a row-wise function and create multiple new columns?

I have two dataframes and a working code, but it's most likely not optimal

df1 (dataframe has thousands of rows and xx number of columns)

sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607

df2 (column header maps to the sic-code in df1. There are in total 12 sic-codes and the dataframe is thousands of rows long)

1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308

I achieved the result I need with the following code:

ind_list = np.arange(1,13) # Create list of industries 


def c_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['const',i]


def a1_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a1bar',i]


def a2_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a2bar',i]


def a3_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a3bar',i]


def a4_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a4bar',i]
            
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')        
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')

The output is something like this:

sic data1 data2 data3 data4 c_bar a1_bar a2_bar a3_bar a4_bar
5 0.10316948 0.61408639 0.04042675 0.79255749 0.56357931 0.42920472 0.20701581 0.67639811 0.37778029
6 0.5730904 0.16753145 0.27835136 0.00178992 0.51793793 0.06772307 0.15084885 0.12451806 0.33114948
3 0.87710893 0.66834187 0.14286608 0.12609769 0.75873957 0.72586804 0.6081763 0.14598001 0.21557266
8 0.24565579 0.56195558 0.93316676 0.20988936 0.67404545 0.65221594 0.79758557 0.67093021 0.33400764
12 0.79703344 0.61066111 0.94602909 0.56218703 0.92384307 0.30836159 0.72521994 0.00795362 0.76348227
6 0.86604791 0.28454782 0.97229172 0.21853932 0.75650652 0.40788056 0.53233553 0.60326386 0.27399405

Cell values in the example are randomly generated, but the point is to map based on sic-codes and add rows from df2 as new columns into df1.

CodePudding user response:

To do this, you need to:

  1. Transpose df2 so that its columns are correct for concatenation
  2. Index it with the df1["sic"] column to get the correct rows
  3. Reset the index of the obtained rows of df2 using .reset_index(drop=True), so that the dataframes can be concatenated correctly. (This replaces the current index e.g. 5, 6, 3, 8, 12, 6 with a new one e.g. 0, 1, 2, 3, 4, 5 while keeping the actual values the same. This is so that pandas doesn't get confused while concatenating them)
  4. Concatenate the two dataframes

Note: I used a method based off of this to read in the dataframe, and it assumed that the columns of df2 were strings but the values of the sic column of df1 were ints. Therefore I used .astype(str) to get step 2 working. If this is not actually the case, you may need to remove the .astype(str).

Here is the single line of code to do these things:

merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)

Here is the full code I used:

from io import StringIO
import pandas as pd

df1 = pd.read_csv(StringIO("""
sic data1   data2   data3   data4   data5
5   0.90783598  0.84722083  0.47149924  0.98724123  0.50654476
6   0.53442684  0.59730371  0.92486887  0.61531646  0.62784041
3   0.56806423  0.09619383  0.33846097  0.71878313  0.96316724
8   0.86933042  0.64965755  0.94549745  0.08866519  0.92156389
12  0.651328    0.37193774  0.9679044   0.36898991  0.15161838
6   0.24555531  0.50195983  0.79114578  0.9290596   0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
    1   2   3   4   5   6   7   8   9   10  11  12
c_bar   0.4955329   0.92970292  0.68049726  0.91325006  0.55578465  0.78056519  0.53954711  0.90335326  0.93986402  0.0204794   0.51575764  0.61144255
a1_bar  0.75781444  0.81052669  0.99910449  0.62181902  0.11797144  0.40031316  0.08561665  0.35296894  0.14445697  0.93799762  0.80641802  0.31379671
a2_bar  0.41432552  0.36313911  0.13091618  0.39251953  0.66249636  0.31221897  0.15988528  0.1620938   0.55143589  0.66571044  0.68198944  0.23806947
a3_bar  0.38918855  0.83689178  0.15838139  0.39943204  0.48615188  0.06299899  0.86343819  0.47975619  0.05300611  0.15080875  0.73088725  0.3500239
a4_bar  0.47201384  0.90874121  0.50417142  0.70047698  0.24820601  0.34302454  0.4650635   0.0992668   0.55142391  0.82947194  0.28251699  0.53170308
"""), sep="\t", index_col=[0])

merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)

print(merged)

which produces the output:

   sic     data1     data2     data3  ...    a1_bar    a2_bar    a3_bar    a4_bar
0    5  0.907836  0.847221  0.471499  ...  0.117971  0.662496  0.486152  0.248206
1    6  0.534427  0.597304  0.924869  ...  0.400313  0.312219  0.062999  0.343025
2    3  0.568064  0.096194  0.338461  ...  0.999104  0.130916  0.158381  0.504171
3    8  0.869330  0.649658  0.945497  ...  0.352969  0.162094  0.479756  0.099267
4   12  0.651328  0.371938  0.967904  ...  0.313797  0.238069  0.350024  0.531703
5    6  0.245555  0.501960  0.791146  ...  0.400313  0.312219  0.062999  0.343025

[6 rows x 11 columns]

CodePudding user response:

Try transposing df2 and applying transformations to it. Transposing a data frame means converting the rows into columns of your data frame.

df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)

then, you can use concatenate the transformed columns of df2 with the columns of df1, using df1 = pd.concat([df1,df2],axis=1).

  • Related