Home > Software design >  split dataframe rows into multiple for fractional values
split dataframe rows into multiple for fractional values

Time:10-29

sample data:

sample = pd.DataFrame({'split_me': [1.5, 2, 4, 3.2], 'copy_me': ['A', 'B', 'C', 'D']})
out = pd.DataFrame({'split_me': [1, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.2], 'copy_me': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D']})

sample:  # input
    split_me    copy_me
0   1.5 A
1   2.0 B
2   4.0 C
3   3.2 D

out:  # desired output
    split_me    copy_me
0   1.0 A
1   0.5 A
2   1.0 B
3   1.0 B
4   1.0 C
5   1.0 C
6   1.0 C
7   1.0 C
8   1.0 D
9   1.0 D
10  1.0 D
11  0.2 D

I tried using something like sample.loc[sample.index.repeat(sample['split_me'])]. However this only repeats for integers and a value like 1.9 returns 1 row when I need it to return 2, and it leaves values in split_me unchanged, whereas I need to duplicate the rows, allocate 1 to split_me if it's greater than 1, otherwise allocate the value.

I can't think of a way to do this without getting loopy and complicated, my best approach is to ceil(split_me) and then run repeat, but I still need a way to allocate the value to duplicated rows. Looking for a simpler solution if anyone has it.

CodePudding user response:

Use a custom repeat function

repeat_float = lambda x: ([1] * int(x // 1))   ([x % 1] if x % 1 != 0 else [])
out = df['split_me'].apply(repeat_float).explode().to_frame().join(df['copy_me'])

Output:

>>> out.reset_index(drop=True)
   split_me copy_me
0         1       A
1       0.5       A
2         1       B
3         1       B
4         1       C
5         1       C
6         1       C
7         1       C
8         1       D
9         1       D
10        1       D
11      0.2       D

CodePudding user response:

Try:

import pandas as pd
import numpy as np

sample = pd.DataFrame({'split_me': [1.5, 2, 4, 3.2], 'copy_me': ['A', 'B', 'C', 'D']})

res = pd.concat([pd.Series(c, index=np.arange(1.0, s, 1.0).tolist()   np.array([s % 1]).tolist()) for s, c in zip(sample["split_me"], sample["copy_me"])])
print(res)

Output

1.0    A
0.5    A
1.0    B
0.0    B
1.0    C
2.0    C
3.0    C
0.0    C
1.0    D
2.0    D
3.0    D
0.2    D
dtype: object

Or cleaner:

def expanded_index(s, c):
    index = np.append(np.arange(1.0, s, 1.0), [s % 1])
    return pd.Series(c, index)


res = pd.concat([expanded_index(s, c) for s, c in zip(sample["split_me"], sample["copy_me"])])
print(res)

CodePudding user response:

Yes we can do this

out = sample.reindex(sample.index.repeat(np.ceil(sample['split_me'])))
out['new'] = 1
con = ~out['copy_me'].duplicated(keep='last') & (out['split_me']%1!=0)
out['new'] = out['new'].mask(con, out['split_me']%1)
out
Out[195]: 
   split_me copy_me  new
0       1.5       A  1.0
0       1.5       A  0.5
1       2.0       B  1.0
1       2.0       B  1.0
2       4.0       C  1.0
2       4.0       C  1.0
2       4.0       C  1.0
2       4.0       C  1.0
3       3.2       D  1.0
3       3.2       D  1.0
3       3.2       D  1.0
3       3.2       D  0.2

CodePudding user response:

We could use np.modf to separate the fractional and integral parts of split_me, then scale up based on the only the integral portions. append the non-zero fractional parts, sort_index to get into the expected order, lastly join back the columns and reset_index to restore the range index:

fractional, integral = np.modf(sample['split_me'])
df = (
    integral.loc[integral.index.repeat(integral)]
        .append(fractional[fractional.ne(0)]).sort_index(kind='stable')
        .to_frame().join(sample[['copy_me']]).reset_index(drop=True)
)

df:

    split_me copy_me
0        1.0       A
1        0.5       A
2        2.0       B
3        2.0       B
4        4.0       C
5        4.0       C
6        4.0       C
7        4.0       C
8        3.0       D
9        3.0       D
10       3.0       D
11       0.2       D

Setup and imports:

import numpy as np
import pandas as pd

sample = pd.DataFrame({
    'split_me': [1.5, 2, 4, 3.2],
    'copy_me': ['A', 'B', 'C', 'D']
})
  • Related