sample data:
sample = pd.DataFrame({'split_me': [1.5, 2, 4, 3.2], 'copy_me': ['A', 'B', 'C', 'D']})
out = pd.DataFrame({'split_me': [1, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.2], 'copy_me': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D']})
sample: # input
split_me copy_me
0 1.5 A
1 2.0 B
2 4.0 C
3 3.2 D
out: # desired output
split_me copy_me
0 1.0 A
1 0.5 A
2 1.0 B
3 1.0 B
4 1.0 C
5 1.0 C
6 1.0 C
7 1.0 C
8 1.0 D
9 1.0 D
10 1.0 D
11 0.2 D
I tried using something like sample.loc[sample.index.repeat(sample['split_me'])]
. However this only repeats for integers and a value like 1.9 returns 1 row when I need it to return 2, and it leaves values in split_me
unchanged, whereas I need to duplicate the rows, allocate 1 to split_me
if it's greater than 1, otherwise allocate the value.
I can't think of a way to do this without getting loopy and complicated, my best approach is to ceil(split_me)
and then run repeat
, but I still need a way to allocate the value to duplicated rows. Looking for a simpler solution if anyone has it.
CodePudding user response:
Use a custom repeat function
repeat_float = lambda x: ([1] * int(x // 1)) ([x % 1] if x % 1 != 0 else [])
out = df['split_me'].apply(repeat_float).explode().to_frame().join(df['copy_me'])
Output:
>>> out.reset_index(drop=True)
split_me copy_me
0 1 A
1 0.5 A
2 1 B
3 1 B
4 1 C
5 1 C
6 1 C
7 1 C
8 1 D
9 1 D
10 1 D
11 0.2 D
CodePudding user response:
Try:
import pandas as pd
import numpy as np
sample = pd.DataFrame({'split_me': [1.5, 2, 4, 3.2], 'copy_me': ['A', 'B', 'C', 'D']})
res = pd.concat([pd.Series(c, index=np.arange(1.0, s, 1.0).tolist() np.array([s % 1]).tolist()) for s, c in zip(sample["split_me"], sample["copy_me"])])
print(res)
Output
1.0 A
0.5 A
1.0 B
0.0 B
1.0 C
2.0 C
3.0 C
0.0 C
1.0 D
2.0 D
3.0 D
0.2 D
dtype: object
Or cleaner:
def expanded_index(s, c):
index = np.append(np.arange(1.0, s, 1.0), [s % 1])
return pd.Series(c, index)
res = pd.concat([expanded_index(s, c) for s, c in zip(sample["split_me"], sample["copy_me"])])
print(res)
CodePudding user response:
Yes we can do this
out = sample.reindex(sample.index.repeat(np.ceil(sample['split_me'])))
out['new'] = 1
con = ~out['copy_me'].duplicated(keep='last') & (out['split_me']%1!=0)
out['new'] = out['new'].mask(con, out['split_me']%1)
out
Out[195]:
split_me copy_me new
0 1.5 A 1.0
0 1.5 A 0.5
1 2.0 B 1.0
1 2.0 B 1.0
2 4.0 C 1.0
2 4.0 C 1.0
2 4.0 C 1.0
2 4.0 C 1.0
3 3.2 D 1.0
3 3.2 D 1.0
3 3.2 D 1.0
3 3.2 D 0.2
CodePudding user response:
We could use np.modf
to separate the fractional and integral parts of split_me
, then scale up based on the only the integral portions. append
the non-zero fractional parts, sort_index
to get into the expected order, lastly join
back the columns and reset_index
to restore the range index:
fractional, integral = np.modf(sample['split_me'])
df = (
integral.loc[integral.index.repeat(integral)]
.append(fractional[fractional.ne(0)]).sort_index(kind='stable')
.to_frame().join(sample[['copy_me']]).reset_index(drop=True)
)
df
:
split_me copy_me
0 1.0 A
1 0.5 A
2 2.0 B
3 2.0 B
4 4.0 C
5 4.0 C
6 4.0 C
7 4.0 C
8 3.0 D
9 3.0 D
10 3.0 D
11 0.2 D
Setup and imports:
import numpy as np
import pandas as pd
sample = pd.DataFrame({
'split_me': [1.5, 2, 4, 3.2],
'copy_me': ['A', 'B', 'C', 'D']
})