I am a bit lost here on how to have an easy the solution in Python Pandas
I have a dataframe with 3 columns:
A B val
P1 P2 12
P1 P2 14
P2 P2 18
P2 P1 17
P1 P3 15
P1 P3 16
P1 P3 13
I want to count group by A and B, value in specifics intervalls, manually defined in another dataframe:
MIN MAX
12 12
13 15
16 17
The result should be the count number on the intervall and rest as presented:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
P1 P2 1 1 0 0
P2 P2 0 0 0 1
P2 P1 0 0 1 0
P1 P3 0 2 1 0
I want to have the result dynamically, if I change intervalls, remove or add other it should change column names or number in the final dataframe.
Thanks for help.
CodePudding user response:
Try something like this using pd.cut
:
df = pd.read_clipboard()
df2 = pd.read_clipboard()
df['labels']=pd.cut(df['val'],
bins=[0] df2['MAX'].tolist() [np.inf],
labels = [f'V_{s}_{e}' for s, e in zip(df2['MIN'], df2['MAX'])] ['V_OTHERS'])
df.groupby(['A','B','labels'])['labels'].count().unstack().reset_index()
Output:
labels A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P1 0 0 0 0
1 P1 P2 1 1 0 0
2 P1 P3 0 2 1 0
3 P2 P1 0 0 1 0
4 P2 P2 0 0 0 1
5 P2 P3 0 0 0 0
CodePudding user response:
Calling the second dataframe limits
below:
diffs = np.subtract.outer(df["val"].to_numpy(),
limits.to_numpy()).reshape(len(df), -1)
from_min, from_max = diffs[:, ::2], diffs[:, 1::2]
counts = (pd.DataFrame((from_min >= 0) & (from_max <= 0))
.groupby([df["A"], df["B"]], sort=False).sum())
counts.columns = limits.astype(str).agg("_".join, axis=1).radd("V_")
counts["V_OTHERS"] = df.groupby(["A", "B"]).count().sub(counts.sum(axis=1), axis=0)
counts = counts.reset_index()
get the "cross" differences of "val" column values against the min & max limits each
- that outer subtraction will give a shape "(len(df), *limits.shape)"
- make it flattened in the last 2 dimensions to make it 2D to add as more columns
differentiate differences from_min and from_max
- check if a value falls in between the ranges: greater than minimum, less than maximum
group these by "A" and "B" and sum those True/False's to count
pull out the names of the new columns from the contents of limits
- row wise aggregation with "_" joining, and add from right "V_"
lastly compute the remainders
- see however A & B pairs there are, and subtract the aforecomputed counts from them
and reset the index to move groupers to columns
to get
>>> counts
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P2 P2 0 0 0 1
2 P2 P1 0 0 1 0
3 P1 P3 0 2 1 0
CodePudding user response:
Try this:
def find_group(val):
if 12 <= val <= 12:
return "V_12_12"
elif 13 <= val <= 15:
return "V_13_15"
elif 16 <= val <= 17:
return "V_16_17"
else:
return "V_OTHERS"
df = pd.DataFrame({
'A':['P1','P1','P2','P2','P1','P1','P1'],
'B':['P2','P2','P2','P1','P3','P3','P3'],
'val':[12,14,18,17,15,16,13]
})
df["group"]=df["val"].apply(find_group)
result=df.groupby(["A","B","group"]).count().unstack(fill_value=0).stack()
result.unstack()
CodePudding user response:
You can use a pd.IntervalIndex
built from your MIN and MAX columns to cut the values before grouping:
import pandas as pd
# Your data here
df = pd.DataFrame({'A': {0: 'P1', 1: 'P1', 2: 'P2', 3: 'P2', 4: 'P1', 5: 'P1', 6: 'P1'}, 'B': {0: 'P2', 1: 'P2', 2: 'P2', 3: 'P1', 4: 'P3', 5: 'P3', 6: 'P3'}, 'val': {0: 12, 1: 14, 2: 18, 3: 17, 4: 15, 5: 16, 6: 13}})
intervals = pd.DataFrame({'MIN': {0: 12, 1: 13, 2: 16}, 'MAX': {0: 12, 1: 15, 2: 17}})
idx = pd.IntervalIndex.from_arrays(intervals["MIN"], intervals["MAX"] , closed="both")
intervals = pd.cut(df["val"], idx)
groups = [df["A"], df["B"]]
renamer = lambda x: f"V_{x.left}_{x.right}" if isinstance(x, pd.Interval) else x
out = pd.concat([
intervals.groupby(groups).value_counts().unstack(), # This handles all values within some interval
intervals.isna().groupby(groups).agg(V_OTHERS="sum") # This handles the V_OTHERS column
], axis=1).rename(columns=renamer).reset_index()
out:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P1 P3 0 2 1 0
2 P2 P1 0 0 1 0
3 P2 P2 0 0 0 1