Home > Enterprise >  How to add a column in pandas dataframe based on other columns for large dataset?
How to add a column in pandas dataframe based on other columns for large dataset?

Time:04-15

I have a CSV file that contains 1,000,000 rows and columns like following.

col1 col2 col3...col20
0     1    0  ... 10 
0     1    0  ... 20 
1     0    0  ... 30
0     1    0  ... 40
0     0    1  ... 50
................

I want to add a new column called col1_col2_col3 like following.

col1 col2 col3 col1_col2_col3 ...col20
0     1    0        2         ... 10 
0     1    0        2         ... 20 
1     0    0        1         ... 30
0     1    0        2         ... 40
0     0    1        3         ... 50
.................

I have loaded the data file in a pandas data frame. Then tried following.

for idx, row in df.iterrows():

    if (df.loc[idx, 'col1'] == 1):
        df.loc[idx, 'col1_col2_col3'] = 1

    elif (df.loc[idx, 'col2'] == 1)
        df.loc[idx, 'col1_col2_col3'] = 2

    elif (df.loc[idx, 'col3'] == 1)
        df.loc[idx, 'col1_col2_col3'] = 3

The above solution wroks. However, my code is taking very long time to run. Is there any way to create col1_col2_col3 fast?

CodePudding user response:

Here's one way using multiplication. The idea is to multiply each column by 1, 2 or 3 depending on which column it is, then keep the nonzero values:

df['col1_col2_col3'] = df[['col1','col2','col3']].mul([1,2,3]).mask(lambda x: x==0).bfill(axis=1)['col1'].astype(int)

N.B. It assumes that each row can have only one nonzero value in columns ['col1_col2_col3'].

Output:

   col1  col2  col3  ...  col20  col1_col2_col3
0     0     1     0  ...     10               2
1     0     1     0  ...     20               2
2     1     0     0  ...     30               1
3     0     1     0  ...     40               2
4     0     0     1  ...     50               3

CodePudding user response:

You can use Numpy's argmax

df.assign(
    col1_col2_col3=
    df[['col1', 'col2', 'col3']].to_numpy().argmax(axis=1)   1
)

   col1  col2  col3  col20  col1_col2_col3
0     0     1     0     10               2
1     0     1     0     20               2
2     1     0     0     30               1
3     0     1     0     40               2
4     0     0     1     50               3
  • Related