i having some trouble with the Dataframe of mine. I have the following DF below. I am trying to group by , one row separated by "-" and other just simply \n. The problem that i have is that i need to has a certain amount of numbers in a row (minimum 4).
a b c
0 a Num_1 0
1 a Num_1 1
2 a Num_1 2
3 a Num_2 5
4 a Num_2 6
5 a Num_2 7
6 a Num_2 8
7 a Num_2 9
And i made the following code:
def split_by_threshold(li):
inds = [0] [ind for ind,(i,j) in enumerate(zip(li,li[1:]),1) if j-i != 1] [len(li) 1]
rez = [li[i:j] for i,j in zip(inds,inds[1:])]
return rez
def dropst(serie):
serie = serie.to_numpy().tolist()
serie = list(dict.fromkeys(serie))
return '\n'.join(serie)
def joining_(series):
series = series.to_numpy().tolist()
if series:
split_li = split_by_threshold(series)
a=[]
for x in split_li:
if x[-1]-x[0]:
a.append(str(x[0]) '-' str(x[-1]))
return '\n'.join(a)
else:
return 'None'
col_1, col_2, col_3 = d.columns
final = d.groupby([col_1], as_index = False).agg(
{ col_1: 'first',
col_2: dropst,
col_3: joining_}
)
print(final)
The answer i receive is :
a b c
0 a Num_1\nNum_2 0-2\n5-9
and i kinda need to be:
a b c
0 a Num_2 5-9
CodePudding user response:
IIUC, you cangroupby
a, b, and eventually a new group to identify consecutive values. Then agg
with a custom function:
def join(s, thresh=4):
MIN = s.min()
MAX = s.max()
return f'{MIN}-{MAX}' if MAX-MIN >= thresh else float('nan')
consecutive = df['c'].diff().ne(1).cumsum()
# could also be
# df.groupby(['a','b'])['c'].diff().ne(1).cumsum()
# but not required as we anyway group by those later
(df
.groupby(['a', 'b', consecutive], as_index=False)
['c']
.agg(join, thresh=4)
.dropna(subset='c')
)
Output:
a b c
2 a Num_2 5-9