Home > OS >  Is there a grouping by in PANDAS but only if it has a specific number
Is there a grouping by in PANDAS but only if it has a specific number

Time:04-06

i having some trouble with the Dataframe of mine. I have the following DF below. I am trying to group by , one row separated by "-" and other just simply \n. The problem that i have is that i need to has a certain amount of numbers in a row (minimum 4).

   a      b  c
0  a  Num_1  0
1  a  Num_1  1
2  a  Num_1  2
3  a  Num_2  5
4  a  Num_2  6
5  a  Num_2  7
6  a  Num_2  8
7  a  Num_2  9

And i made the following code:

def split_by_threshold(li):
    inds = [0] [ind for ind,(i,j) in enumerate(zip(li,li[1:]),1) if j-i != 1] [len(li) 1]
    rez = [li[i:j] for i,j in zip(inds,inds[1:])]
    return rez

def dropst(serie):
    serie = serie.to_numpy().tolist()
    serie = list(dict.fromkeys(serie))
    return '\n'.join(serie)

def joining_(series):
    series = series.to_numpy().tolist()
    if series:
        split_li = split_by_threshold(series)
        a=[]
        for x in split_li:
            if x[-1]-x[0]:
                a.append(str(x[0]) '-' str(x[-1]))
        return '\n'.join(a)
    else:
        return 'None'

col_1, col_2, col_3 = d.columns
final = d.groupby([col_1], as_index = False).agg(
    {   col_1: 'first',
        col_2: dropst,
        col_3: joining_}
)

print(final)

The answer i receive is :

   a             b         c
0  a  Num_1\nNum_2  0-2\n5-9

and i kinda need to be:

   a   b      c
0  a   Num_2  5-9

CodePudding user response:

IIUC, you cangroupby a, b, and eventually a new group to identify consecutive values. Then agg with a custom function:

def join(s, thresh=4):
    MIN = s.min()
    MAX = s.max()
    return f'{MIN}-{MAX}' if MAX-MIN >= thresh else float('nan')

consecutive = df['c'].diff().ne(1).cumsum()
# could also be
# df.groupby(['a','b'])['c'].diff().ne(1).cumsum()
# but not required as we anyway group by those later

(df
 .groupby(['a', 'b', consecutive], as_index=False)
 ['c']
 .agg(join, thresh=4)
 .dropna(subset='c')
 )

Output:

   a      b    c
2  a  Num_2  5-9
  • Related