everyone
I have a dataframe with 2 million unique codes for students and two other columns: initial and final year. I need to create a new dataframe with only two columns (student cod and year), with one row for each year the student remained studying. For instance, if student with code 1234567 studied from 2013 to 2015, the new dataframe must have three rows, as shown below:
| COD | YEAR |
|-------- | ------ |
| 1234567 | 2013 |
| 1234567 | 2014 |
| 1234567 | 2015 |
I have the following for loop working:
import pandas as pd
import numpy as np
# creating a df
df = pd.DataFrame({
'COD': np.random.randint(100, 1000000, size=18),
'YEAR_INCLUSION' : [2017, 2018, 2020] * 6,
'YEAR_END' : [2019, 2020, 2021] * 6,
})
newdf = pd.DataFrame(columns = ['COD', 'YEAR'])
for index, row in df.iterrows():
for i in range(row['YEAR_INCLUSION'], row['YEAR_END'] 1):
newdf = pd.concat([df, pd.DataFrame.from_records([{ 'COD': row['BOLSISTA_CODIGO'], 'YEAR': i }])])
The problem is time. Even splitting the data into smaller df, it takes too long. With a 411,000 lines split the code takes 16~20 hours.
I tried the same code with itertuples, but times were significantly slower, though itertuples is known for being better then iterrows:
newdf = pd.DataFrame(columns = ['COD', 'YEAR'])
for index, row in df.itertuples():
for i in range(row.YEAR_INCLUSION, row.YEAR_END 1):
newdf = pd.concat([df, pd.DataFrame.from_records([{ 'COD': row.BOLSISTA_CODIGO, 'YEAR': i }])])
I couldn't figure out a way to use map
or apply
, which allegedly would present much better results.
Thanks in advance for the help!
CodePudding user response:
For improve performance use vectorized solution - Index.repeat
with DataFrame.loc
for new rows and for YEAR
column add counter by GroupBy.cumcount
:
dif = df['YEAR_END'].sub(df['YEAR_INCLUSION']).add(1)
df = (df.loc[df.index.repeat(dif), ['COD','YEAR_INCLUSION']]
.rename(columns={'YEAR_INCLUSION':'YEAR'}))
df['YEAR'] = df.groupby(level=0).cumcount()
df = df.reset_index(drop=True)
print (df.head(10))
COD YEAR
0 476468 2017
1 476468 2018
2 476468 2019
3 476009 2018
4 476009 2019
5 476009 2020
6 508035 2020
7 508035 2021
8 157027 2017
9 157027 2018
Performance in sample data:
#18krows
df = pd.DataFrame({
'COD': np.random.randint(100, 1000000, size=18000),
'YEAR_INCLUSION' : [2017, 2018, 2020] * 6000,
'YEAR_END' : [2019, 2020, 2021] * 6000,
})
def cel_new(df):
df['YEAR'] = [range(x, y 1) for x,y in zip(df['YEAR_INCLUSION'],df['YEAR_END'])]
df = df.drop(columns=['YEAR_INCLUSION','YEAR_END']).explode('YEAR')
return df
def jez(df):
dif = df['YEAR_END'].sub(df['YEAR_INCLUSION']).add(1)
df = (df.loc[df.index.repeat(dif), ['COD','YEAR_INCLUSION']]
.rename(columns={'YEAR_INCLUSION':'YEAR'}))
df['YEAR'] = df.groupby(level=0).cumcount()
return df
def cel(df):
df['YEAR'] = df.apply(lambda row: list(range(row['YEAR_INCLUSION'], row['YEAR_END'] 1)), axis=1)
df = df.drop(columns=['YEAR_INCLUSION','YEAR_END']).explode('YEAR')
return df
In [160]: %timeit jez(df)
10.2 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit cel_new(df)
51.6 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [162]: %timeit cel(df)
273 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
I would use a combination of range()
in order to create a list of all the years the will be included per COD and explode()
it:
df['YEAR'] = [range(x, y 1) for x,y in zip(test_df['YEAR_INCLUSION'],test_df['YEAR_END'])]
df = df.drop(columns=['YEAR_INCLUSION','YEAR_END']).explode('YEAR')
Outputting the two columns as you need:
COD YEAR
0 68305 2017
0 68305 2018
0 68305 2019
1 132110 2018
1 132110 2019
1 132110 2020
2 884918 2020
2 884918 2021
3 418804 2017
3 418804 2018
3 418804 2019
4 371878 2018
4 371878 2019
4 371878 2020
5 759197 2020
5 759197 2021
6 315279 2017
6 315279 2018
6 315279 2019
7 479690 2018
7 479690 2019
7 479690 2020
8 352097 2020
8 352097 2021
9 527199 2017
9 527199 2018
9 527199 2019
10 601018 2018
10 601018 2019
10 601018 2020
11 41268 2020
11 41268 2021
12 603053 2017
12 603053 2018
12 603053 2019
13 193902 2018
13 193902 2019
13 193902 2020
14 742955 2020
14 742955 2021
15 995129 2017
15 995129 2018
15 995129 2019
16 78649 2018
16 78649 2019
16 78649 2020
17 675890 2020
17 675890 2021