I have a dataframe (much larger than this example)as follows where all rows in the first two columns are repeated 5 times.
import pandas as pd
df = pd.DataFrame({'text':['the weather is nice','the weather is nice','the weather is nice','the weather is nice','the weather is nice',
'the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful',
'the day is long','the day is long','the day is long','the day is long','the day is long'],
'reference':['weather','weather','weather','weather','weather',
'house','house','house','house','house',
'day','day','day','day','day'],
'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
I would like to divide this pandas dataframe to two dataframes in a way that the first two consecutive rows appear in one and the three others appear in the second dataframe as follows.
The desired output:
first df:
text reference id
0 the weather is nice weather 1
1 the weather is nice weather 2
3 the house is beautiful house 6
4 the house is beautiful house 7
5 the day is long day 11
6 the day is long day 12
second df:
text reference id
0 the weather is nice weather 3
1 the weather is nice weather 4
2 the weather is nice weather 5
3 the house is beautiful house 8
4 the house is beautiful house 9
5 the house is beautiful house 10
6 the day is long day 13
7 the day is long day 14
8 the day is long day 15
obviously selecting n-rows does not work (e,g df.iloc[::3, :] or df[df.index % 3 == 0]) so I would like to know how the above-mentioned output would be possible.
CodePudding user response:
If you want to group on the value of reference (first 2 items vs rest):
mask = df.groupby('reference').cumcount().gt(1)
groups = [g for k,g in df.groupby(mask)]
# or manually
# df1 = df[~mask]
# df2 = df[mask]
Using position:
mask = (np.arange(len(df))%5)<1
# or with a range index
# mask = df.index.mod(5).gt(1)
# then same as above using groupby or slicing
CodePudding user response:
Make a mask m
:
import numpy as np
m = np.tile([True, True, False, False, False], len(df) // 5)
df1 = df[m]
df2 = df[~m]