I have a dataframe with the following column the follows this format:
df = pd.DataFrame(data={
'value': [123, 456, 789, 111, 121, 34523, 4352, 45343, 623]
'repeatVal': ['NaN', 2, 'NaN', 'NaN', 3, 'NaN', 'NaN', 'NaN', 'NaN'],
})
I want to create a new column that takes the values from 'value' and repeats it the number of times downward from 'repeatVal' so the output looks like 'result':
df = pd.DataFrame(data={
'value': [123, 456, 789, 111, 121, 34523, 4352, 45343, 623]
'repeatVal': ['NaN', 2, 'NaN', 'NaN', 3, 'NaN', 'NaN', 'NaN', 'NaN'],
'result': ['NaN', 456, 456, 'NaN', 121, 121, 121, 'NaN', 'NaN']
})
To be clear, I do not want to duplicate the rows, I only want to create a new col where values are repeated n times, where n is specified in a different column. The format of the column 'repeatVals' is such that there will never be overlap--that there will always be sufficient NaN values between the repeat indicators in 'repeatVals'
I have read the docs on np.repeat and np.tile but those don't appear to solve this issue.
CodePudding user response:
One option using groupby.cumcount
as masks:
df = df.replace('NaN', float('nan'))
m1 = df['repeatVal'].notna()
m2 = df.groupby(m1.cumsum()).cumcount().lt(df['repeatVal'].ffill())
df['result'] = df['value'].where(m1).ffill().where(m2)
Output:
value repeatVal result
0 123 NaN NaN
1 456 2.0 456.0
2 789 NaN 456.0
3 111 NaN NaN
4 121 3.0 121.0
5 34523 NaN 121.0
6 4352 NaN 121.0
7 45343 NaN NaN
8 623 NaN NaN
Intermediates:
value repeatVal result m1 m1.cumsum() cumcount cumcount < repeatVal.ffill() value/masked/ffill
0 123 NaN NaN False 0 0 False NaN
1 456 2.0 456.0 True 1 0 True 456.0
2 789 NaN 456.0 False 1 1 True 456.0
3 111 NaN NaN False 1 2 False 456.0
4 121 3.0 121.0 True 2 0 True 121.0
5 34523 NaN 121.0 False 2 1 True 121.0
6 4352 NaN 121.0 False 2 2 True 121.0
7 45343 NaN NaN False 2 3 False 121.0
8 623 NaN NaN False 2 4 False 121.0