there is a column with 4 categories, I want to display the frequency of occurrence of values of other columns for each unique value example of partial Data
Output
CodePudding user response:
Starting from this:
df = pd.DataFrame(
{
"cat1": ["yes", "no", "yes", "no", "yes"],
"cat2": ["a", "a", "b", "b", "a"],
"cat3": ["yes", "no", "no", "yes", "no"],
"quant": [1, 2, 3, 4, 5],
}
)
Sample DataFrame:
cat1 cat2 cat3 quant
0 yes a yes 1
1 no a no 2
2 yes b no 3
3 no b yes 4
4 yes a no 5
You can do:
y = lambda x: x.value_counts(normalize=True).loc["yes"]
n = lambda x: x.value_counts(normalize=True).loc["no"]
df.groupby(["cat2"]).agg(
{
"cat1": [("yes", y), ("no", n)],
"cat3": [("yes", y), ("no", n)],
"quant": ["min", "max", "mean"],
}
)
Result:
cat1 cat3 quant
yes no yes no min max mean
cat2
a 0.666667 0.333333 0.333333 0.666667 1 5 2.666667
b 0.500000 0.500000 0.500000 0.500000 3 4 3.500000
Here's a slightly more robust version:
from functools import partial
def agg_func(s: pd.Series, name: str):
try:
return s.value_counts(normalize=True).loc[name]
except KeyError:
return 0
yes_no_agg = [
("yes", partial(agg_func, name="yes")),
("no", partial(agg_func, name="no")),
]
df.groupby(["cat2"]).agg(
{
"cat1": yes_no_agg,
"cat3": yes_no_agg,
"quant": ["min", "max", "mean"],
}
)
CodePudding user response:
How to generate descriptive information by category
I propose to separate two cases. One is numerical data to see their min, max and average by category. The other is labeled data and their frequency distribution.
Min, max, mean of numerical data by category
First, let's prepare the data with 'mode'
as a category column and two numerical fields 'Measure1', 'Measure2'
:
import pandas as pd
data = {'mode': ['i','i','i','ii','ii','iii','ii','iii','ii'],
'Ofloxacin': ['no','no','no','yes','no','no','yes','no','no'],
'ChangeMode': ['yes','no','no','yes','no','yes','yes','no','yes'],
'Measure1': [*range(1, 10)],
'Measure2': [*range(-9, 0)]}
df = pd.DataFrame(data)
In this case, I propose to use DataFrame.describe
to generate statistics for numerical data:
df.groupby('mode').agg('describe')
In the output we can find min, max, mean
:
Labeled data and their frequency distribution by category
1. The same labels in different columns
In this particular case we can see that columns 'Ofloxacin','ChangeMode'
are labeled the same. For that reason, let's use DataFrame(...).stack
to form rows with label distribution data by categories and fields:
columns = ['Ofloxacin','ChangeMode']
df.groupby('mode')[columns].apply(lambda group: (
pd.DataFrame((group.value_counts(col, True) for col in columns), index=columns)
.stack(dropna=False)
.fillna(0)
.sort_index()
)
)
Here's the output on the same data as in the previous case:
2. Different labeling in different columns.
Let's add new columns with slightly more varied content:
data = {'mode': ['i','i','i','ii','ii','iii','ii','iii','ii'],
'Ofloxacin': ['no','no','no','yes','no','no','yes','no','no'],
'ChangeMode': ['yes','no','no','yes','no','yes','yes','no','yes'],
'Reaction': ['good','bad','so-so','good','good','so-so','bad','bad','good'],
'DummyData': 'hello my world and all the people out there'.split()}
df = pd.DataFrame(data)
In this case, the previous solution won't work. To fix it, let's prepare a multi index for expected columns:
from functools import reduce
columns = ['Ofloxacin','ChangeMode','Reaction','DummyData']
multicol = pd.MultiIndex.from_tuples(
reduce(
lambda x, y: x.append(y),
(
pd.Index(sorted(df[col].unique()))
.map(lambda label: (col, label))
for col in columns
),
pd.Index([])
)
)
I'm gonna use it to store data in the right place like this:
final_output = df.groupby('mode')[columns].apply(
lambda group:
pd.DataFrame(
(group.value_counts(col, True) for col in columns),
index=columns
).stack().reindex(multicol)
).fillna(0)
Here's the final output:
p.s. The notebook on Colab for experiments