Repeat several rows with the same order in pandas dataframe-CodePudding

I have a dataframe like this.

Input:

index colA   colB   colC 
0      A      1      QQQ  
1      A      1      WWW  
2      A      1      EEE  
3      A      1      RRR   
4      B      2      TTT   
5      B      2      YYY  
6      B      2      UUU  
7      C      1      III

I would like to repeat rows with same colA for value at colB times. For example, row index 4-6 has colA of B, and the value at colB is 2, then I want to repeat row 4-6 two times (order of 4-6 does not change), as shown in the output. If value at colB is 3, then repeat three times, if value is 1, then don't repeat. In addition, I would like to add a column colC indicate the index of colA, such that colA=A has no repeat, then the num is 0, colA=B has two repeats, the num is 1 and 2, ... Is there an elegant way to do this in pandas dataframe?

Output:

index colA   colB   colC    num
0      A      1      QQQ     0
1      A      1      WWW     0
2      A      1      EEE     0
3      A      1      RRR     0
4      B      2      TTT     1
5      B      2      YYY     1
6      B      2      UUU     1
7      B      2      TTT     2
8      B      2      YYY     2
9      B      2      UUU     2
10     C      1      III     3

CodePudding user response：

Use custom function for repeat in order (slow in many groups) with keys parameter, so possible add new column num by GroupBy.ngroup:

def f(x):
    pos = int(x.iat[0, x.columns.get_loc('colB')])
    return pd.concat([x] * pos, keys=range(pos))   

df = df.groupby('colA', group_keys=False, sort=False).apply(f).reset_index(level=0)

df['num'] = df.groupby(['level_0','colA'], sort=False).ngroup()
df = df.drop('level_0', axis=1)
print (df)
      colA  colB colC  num
index                     
0        A     1  QQQ    0
1        A     1  WWW    0
2        A     1  EEE    0
3        A     1  RRR    0
4        B     2  TTT    1
5        B     2  YYY    1
6        B     2  UUU    1
4        B     2  TTT    2
5        B     2  YYY    2
6        B     2  UUU    2
7        C     1  III    3

CodePudding user response：

IIUC, you could use index.repeat to duplicate the index and GroupBy.cumcount to compute your "num" column:

(df
 .loc[df.index.repeat(df['colB'])]
 .assign(num=lambda d: d.groupby(['colA', 'colC']).cumcount().cumsum()) 
)

output:

   index colA  colB colC  num
0      0    A     1  QQQ    0
1      1    A     1  WWW    0
2      2    A     1  EEE    0
3      3    A     1  RRR    0
4      4    B     2  TTT    0
4      4    B     2  TTT    1
5      5    B     2  YYY    1
5      5    B     2  YYY    2
6      6    B     2  UUU    2
6      6    B     2  UUU    3
7      7    C     1  III    3