Count group by where in intervall-CodePudding

I am a bit lost here on how to have an easy the solution in Python Pandas

I have a dataframe with 3 columns:

A  B  val
P1 P2 12
P1 P2 14
P2 P2 18
P2 P1 17
P1 P3 15
P1 P3 16
P1 P3 13

I want to count group by A and B, value in specifics intervalls, manually defined in another dataframe:

MIN MAX
12  12
13  15
16  17

The result should be the count number on the intervall and rest as presented:

A  B  V_12_12 V_13_15 V_16_17 V_OTHERS
P1 P2 1       1       0       0        
P2 P2 0       0       0       1
P2 P1 0       0       1       0       
P1 P3 0       2       1       0

I want to have the result dynamically, if I change intervalls, remove or add other it should change column names or number in the final dataframe.

Thanks for help.

CodePudding user response：

Try something like this using pd.cut:

df = pd.read_clipboard()
df2 = pd.read_clipboard()

df['labels']=pd.cut(df['val'], 
                    bins=[0] df2['MAX'].tolist() [np.inf], 
                    labels = [f'V_{s}_{e}' for s, e in zip(df2['MIN'], df2['MAX'])] ['V_OTHERS'])

df.groupby(['A','B','labels'])['labels'].count().unstack().reset_index()

Output:

labels   A   B  V_12_12  V_13_15  V_16_17  V_OTHERS
0       P1  P1        0        0        0         0
1       P1  P2        1        1        0         0
2       P1  P3        0        2        1         0
3       P2  P1        0        0        1         0
4       P2  P2        0        0        0         1
5       P2  P3        0        0        0         0

CodePudding user response：

Calling the second dataframe limits below:

diffs = np.subtract.outer(df["val"].to_numpy(),
                          limits.to_numpy()).reshape(len(df), -1)
from_min, from_max = diffs[:, ::2], diffs[:, 1::2]

counts = (pd.DataFrame((from_min >= 0) & (from_max <= 0))
            .groupby([df["A"], df["B"]], sort=False).sum())

counts.columns = limits.astype(str).agg("_".join, axis=1).radd("V_")

counts["V_OTHERS"] = df.groupby(["A", "B"]).count().sub(counts.sum(axis=1), axis=0)

counts = counts.reset_index()

get the "cross" differences of "val" column values against the min & max limits each
- that outer subtraction will give a shape "(len(df), *limits.shape)"
- make it flattened in the last 2 dimensions to make it 2D to add as more columns
differentiate differences from_min and from_max
- check if a value falls in between the ranges: greater than minimum, less than maximum
group these by "A" and "B" and sum those True/False's to count
pull out the names of the new columns from the contents of limits
- row wise aggregation with "_" joining, and add from right "V_"
lastly compute the remainders
- see however A & B pairs there are, and subtract the aforecomputed counts from them
and reset the index to move groupers to columns

to get

>>> counts
    A   B  V_12_12  V_13_15  V_16_17  V_OTHERS
0  P1  P2        1        1        0         0
1  P2  P2        0        0        0         1
2  P2  P1        0        0        1         0
3  P1  P3        0        2        1         0

CodePudding user response：

Try this:

def find_group(val):
  if 12 <= val <= 12:
    return "V_12_12"
  elif 13 <= val <= 15:
    return "V_13_15"
  elif 16 <= val <= 17:
    return "V_16_17"
  else:
    return "V_OTHERS"

df = pd.DataFrame({
    'A':['P1','P1','P2','P2','P1','P1','P1'],
    'B':['P2','P2','P2','P1','P3','P3','P3'],
    'val':[12,14,18,17,15,16,13]
})
df["group"]=df["val"].apply(find_group)

result=df.groupby(["A","B","group"]).count().unstack(fill_value=0).stack()

result.unstack()

CodePudding user response：

You can use a pd.IntervalIndex built from your MIN and MAX columns to cut the values before grouping:

import pandas as pd

# Your data here
df = pd.DataFrame({'A': {0: 'P1', 1: 'P1', 2: 'P2', 3: 'P2', 4: 'P1', 5: 'P1', 6: 'P1'}, 'B': {0: 'P2', 1: 'P2', 2: 'P2', 3: 'P1', 4: 'P3', 5: 'P3', 6: 'P3'}, 'val': {0: 12, 1: 14, 2: 18, 3: 17, 4: 15, 5: 16, 6: 13}})
intervals = pd.DataFrame({'MIN': {0: 12, 1: 13, 2: 16}, 'MAX': {0: 12, 1: 15, 2: 17}})

idx = pd.IntervalIndex.from_arrays(intervals["MIN"], intervals["MAX"] , closed="both")

intervals = pd.cut(df["val"], idx)
groups = [df["A"], df["B"]]
renamer = lambda x: f"V_{x.left}_{x.right}" if isinstance(x, pd.Interval) else x

out = pd.concat([
    intervals.groupby(groups).value_counts().unstack(),      # This handles all values within some interval
    intervals.isna().groupby(groups).agg(V_OTHERS="sum")     # This handles the V_OTHERS column
], axis=1).rename(columns=renamer).reset_index()

out:

    A   B  V_12_12  V_13_15  V_16_17  V_OTHERS
0  P1  P2        1        1        0         0
1  P1  P3        0        2        1         0
2  P2  P1        0        0        1         0
3  P2  P2        0        0        0         1