How to efficiently iterate to create a new dataframe from old dataframe with iterrows or itertuples-CodePudding

everyone

I have a dataframe with 2 million unique codes for students and two other columns: initial and final year. I need to create a new dataframe with only two columns (student cod and year), with one row for each year the student remained studying. For instance, if student with code 1234567 studied from 2013 to 2015, the new dataframe must have three rows, as shown below:

|  COD     | YEAR   |
|--------  | ------ |
|  1234567 | 2013   |
|  1234567 | 2014   |
|  1234567 | 2015   |

I have the following for loop working:

import pandas as pd
import numpy as np

# creating a df
df = pd.DataFrame({
    'COD': np.random.randint(100, 1000000, size=18),
    'YEAR_INCLUSION' : [2017, 2018, 2020] * 6,
    'YEAR_END' : [2019, 2020, 2021] * 6,
})

newdf = pd.DataFrame(columns = ['COD', 'YEAR'])
for index, row in df.iterrows():
    for i in range(row['YEAR_INCLUSION'], row['YEAR_END'] 1):
        newdf = pd.concat([df, pd.DataFrame.from_records([{ 'COD': row['BOLSISTA_CODIGO'], 'YEAR': i }])])

The problem is time. Even splitting the data into smaller df, it takes too long. With a 411,000 lines split the code takes 16~20 hours.

I tried the same code with itertuples, but times were significantly slower, though itertuples is known for being better then iterrows:

newdf = pd.DataFrame(columns = ['COD', 'YEAR'])
for index, row in df.itertuples():
    for i in range(row.YEAR_INCLUSION, row.YEAR_END 1):
        newdf = pd.concat([df, pd.DataFrame.from_records([{ 'COD': row.BOLSISTA_CODIGO, 'YEAR': i }])])

I couldn't figure out a way to use map or apply, which allegedly would present much better results.

Thanks in advance for the help!

CodePudding user response：

For improve performance use vectorized solution - Index.repeat with DataFrame.loc for new rows and for YEAR column add counter by GroupBy.cumcount:

dif = df['YEAR_END'].sub(df['YEAR_INCLUSION']).add(1)
df = (df.loc[df.index.repeat(dif), ['COD','YEAR_INCLUSION']]
        .rename(columns={'YEAR_INCLUSION':'YEAR'}))
df['YEAR']  = df.groupby(level=0).cumcount()
df = df.reset_index(drop=True)
print (df.head(10))
      COD  YEAR
0  476468  2017
1  476468  2018
2  476468  2019
3  476009  2018
4  476009  2019
5  476009  2020
6  508035  2020
7  508035  2021
8  157027  2017
9  157027  2018

Performance in sample data:

#18krows
df = pd.DataFrame({
    'COD': np.random.randint(100, 1000000, size=18000),
    'YEAR_INCLUSION' : [2017, 2018, 2020] * 6000,
    'YEAR_END' : [2019, 2020, 2021] * 6000,
})

def cel_new(df):
    df['YEAR'] = [range(x, y 1) for x,y in zip(df['YEAR_INCLUSION'],df['YEAR_END'])]
    df = df.drop(columns=['YEAR_INCLUSION','YEAR_END']).explode('YEAR') 
    return df

def jez(df):
    dif = df['YEAR_END'].sub(df['YEAR_INCLUSION']).add(1)
    df = (df.loc[df.index.repeat(dif), ['COD','YEAR_INCLUSION']]
            .rename(columns={'YEAR_INCLUSION':'YEAR'}))
    df['YEAR']  = df.groupby(level=0).cumcount()
    return df
   
def cel(df):
    df['YEAR'] = df.apply(lambda row: list(range(row['YEAR_INCLUSION'], row['YEAR_END'] 1)), axis=1)
    df = df.drop(columns=['YEAR_INCLUSION','YEAR_END']).explode('YEAR') 
    return df

In [160]: %timeit jez(df)
10.2 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [161]: %timeit cel_new(df)
51.6 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [162]: %timeit cel(df)
273 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

I would use a combination of range() in order to create a list of all the years the will be included per COD and explode() it:

df['YEAR'] = [range(x, y 1) for x,y in zip(test_df['YEAR_INCLUSION'],test_df['YEAR_END'])]
df = df.drop(columns=['YEAR_INCLUSION','YEAR_END']).explode('YEAR')

Outputting the two columns as you need:

       COD  YEAR
0    68305  2017
0    68305  2018
0    68305  2019
1   132110  2018
1   132110  2019
1   132110  2020
2   884918  2020
2   884918  2021
3   418804  2017
3   418804  2018
3   418804  2019
4   371878  2018
4   371878  2019
4   371878  2020
5   759197  2020
5   759197  2021
6   315279  2017
6   315279  2018
6   315279  2019
7   479690  2018
7   479690  2019
7   479690  2020
8   352097  2020
8   352097  2021
9   527199  2017
9   527199  2018
9   527199  2019
10  601018  2018
10  601018  2019
10  601018  2020
11   41268  2020
11   41268  2021
12  603053  2017
12  603053  2018
12  603053  2019
13  193902  2018
13  193902  2019
13  193902  2020
14  742955  2020
14  742955  2021
15  995129  2017
15  995129  2018
15  995129  2019
16   78649  2018
16   78649  2019
16   78649  2020
17  675890  2020
17  675890  2021