Home > Back-end >  Duplicate rows according to a function of two columns
Duplicate rows according to a function of two columns

Time:12-19

My initial example dataframe is in the following format:

>>> import pandas as pd
>>> d = {'n': ['one', 'two', 'three', 'four'],
         'initial': [3, 4, 10, 10],
         'final': [3, 7, 11, 7],}
>>> df = pd.DataFrame(d)
>>> df
       n  initial  final
0    one        3      3
1    two        4      7
2  three       10     11
3   four       10      7

What I hope to achieve is to duplicate the values in the n column a number of times corresponding to the values between those in the initial and final columns.

For example, in the first row, initial and final hold the same value, so there should be one instance of 'one' in the output dataframe's n column. For the second row, initial and final are three numbers apart, so there should be four repetitions of 'two', and so on. If final is less than initial, there should be no instances of the value in n in the output.

There should also be a count column which counts up from the value in the initial column to the value in the final column. My expected output is as follows:

       n  count
0    one      3
1    two      4
2    two      5
3    two      6
4    two      7
5  three     10
6  three     11

I've tried using reindex with a new index based on df.final - df.initial 1, but this does not handle the negative values as in the fourth row of the example dataframe.

CodePudding user response:

In your case create the list with range then explode

# in order to save running time , df = df.query('initial<=final')
df['count'] = [ range(x, y   1) for x, y in zip(df['initial'], df['final'])]
out = df.explode('count').dropna()
Out[13]: 
       n  initial  final count
0    one        3      3     3
1    two        4      7     4
1    two        4      7     5
1    two        4      7     6
1    two        4      7     7
2  three       10     11    10
2  three       10     11    11

CodePudding user response:

out = (df[['n']].join(df.apply(lambda x: range(x[1], x[2]   1), axis=1)
                      .explode().to_frame('count'))
       .dropna(subset=['count']))

out

      n  count
0    one      3
1    two      4
2    two      5
3    two      6
4    two      7
5  three     10
6  three     11
  • Related