Home > Enterprise >  Divide into groups according to the specified attribute
Divide into groups according to the specified attribute

Time:06-24

I need to group the data in such a way that if the difference between the adjacent values from column a1 was equal to the same pre-specified value, then they belong to the same group. If the value between two adjacent elements is different, then all subsequent data belong to a different group. For example, I have such a data table

import pandas as pd
import numpy as np

data = [
    [5, 2],
    [100, 23],
    [101, -2],
    [303, 9],
    [304, 4],
    [709, 14],
    [710, 3],
    [711, 3],
    [988, 21]
]
columns = ['a1', 'a2']
df = pd.DataFrame(data=data, columns=columns)

If the difference between the elements of column a1 is equal to one, then they belong to the same group and the answer in this example will be the following:

[[0], [1, 2], [3, 4], [5, 6, 7], [8]]

The output list stores indexes that correspond to rows from df.

It may also be useful that column a1 is ordered. Thank you for your help!

CodePudding user response:

Assuming that your data frame is sorted by a1 and that I understood your problem correctly, I think you could do something like this:

import pandas as pd
import numpy as np
from numba import njit

data = [
    [5, 2],
    [100, 23],
    [101, -2],
    [303, 9],
    [304, 4],
    [709, 14],
    [710, 3],
    [711, 3],
    [988, 21]
]
columns = ['a1', 'a2']
df = pd.DataFrame(data=data, columns=columns)

@njit
def get_groups(vals):
    counter = 0
    group = []
    for i in range(len(vals)-1):
        if vals[i 1]-vals[i] == 1:
            group.append(counter)
        else:
            group.append(counter)
            counter  = 1
    if vals[-1] - vals[-2] == 1: group.append(group[-1])
    else: group.append(counter   1)
        
    return group  
    
groups = get_groups(df['a1'].values)
assert len(groups) == len(df)

df['group'] = groups
final_ls = df.reset_index().groupby(['group']).agg({'index': list})['index'].to_list()
final_ls

------------------------------------------------------------
[[0], [1, 2], [3, 4], [5, 6, 7], [8]]
------------------------------------------------------------

The njit decorator from numba makes the looping approach efficient.

CodePudding user response:

We are sorting the the Dataframe by "a1" column, then finding the difference of adjacent values. Now we have the difference, we can start grouping.

import pandas as pd

data = [
    [5, 2],
    [100, 23],
    [101, -2],
    [303, 9],
    [304, 4],
    [709, 14],
    [710, 3],
    [711, 3],
    [988, 21]
]
columns = ['a1', 'a2']
df = pd.DataFrame(data=data, columns=columns)

# To sort the values of "a1" column
df=df.sort_values(by=['a1'])
# To find the difference between the adjacent values
df['difference']=df['a1'].diff()

# Sorted index
indexs=df.index.tolist()

group=[]

# To check the difference, before this row
check=-1
for i,diff in enumerate(df['difference']):
    if diff==0 or diff==1:
        if check==1:
            group[-1].append(indexs[i])
        else:
            group.append([indexs[i-1],indexs[i]])
    check=diff

# For finding indexes that are not in group
z=[]
for x in group: [z.append(w) for w in x]
for t in (set(indexs)-set(z)):
    group.append([t])
print(group)

Result:

[[1, 2], [3, 4], [5, 6, 7], [0], [8]]
  • Related