I have a dataframe that I'd like to group by id and the common 'top level' string with the count of 'lower-level' strings it includes. For example,
id name
1 AA-BB-CC-DD
1 AA-BB-CC
1 AA-BB-CC-DD-EE
1 AA-BB-UU-VV
1 AA-BB-UU
1 FF-MM-NN
1 FF-MM-NN-PP
2 XX-YY
2 XX-ZZ
2 XX-ZZ-AA
for id 1 the first top level of column name is AA, second BB, third CC, etc. The common 'highest' level for id 1 in the data frame is AA-BB-CC.
The desired output is
id name count
1 AA-BB-CC 3
1 AA-BB-UU 2
1 FF-MM-NN 2
2 XX-YY 1
2 XX-ZZ 2
Thank you.
CodePudding user response:
IIUC, you can form groups using the first level, then groupby.agg
using a custom function:
group = df['name'].str.extract('^([^-] )', expand=False)
def min_string(s):
return min(s, key=lambda x: x.count('-'))
out = (df
.groupby(['id', group], as_index=False)
.agg(name=('name', min_string),
count=('name', 'count')
)
)
output:
id name count
0 1 AA-BB-CC 3
1 1 FF-MM-NN 2
2 2 XX-YY 3
update: handle minimal set of first level(s)
change min_string
to:
def min_string(s):
return '-'.join(x[0] for x in zip(*s.str.split('-'))
if len(set(x)) == 1)
example input:
id name
0 1 AA-BB-CC-DD
1 1 AA-BB-CC
2 1 AA-BB-CC-DD-EE
3 1 AA-BB-UU-VV
4 1 AA-BB-UU
5 1 FF-MM-NN
6 1 FF-MM-NN-PP
7 2 XX-YY
8 2 XX-ZZ
9 2 XX-ZZ-AA
output:
id name count
0 1 AA-BB 5
1 1 FF-MM-NN 2
2 2 XX 3
update2: maximal common subgroup
def make_groups(s, sep='-'):
d = {}
s = s.str.split(sep).sort_values()
prev = s.iloc[0]
for idx, val in s.items():
if val[:len(prev)] != prev:
prev = val
d[idx] = sep.join(prev)
return pd.Series(d, index=s.index)
group = df.groupby('id', group_keys=False)['name'].apply(make_groups)
out = (df
.groupby(['id', group], as_index=False)
.agg(name=('name', min_string),
count=('name', 'count')
)
)
output:
id name count
0 1 AA-BB-CC 3
1 1 AA-BB-UU 2
2 1 FF-MM-NN 2
3 2 XX-YY 1
4 2 XX-ZZ 2
used input:
id name
0 1 AA-BB-CC-DD
1 1 AA-BB-CC
2 1 AA-BB-CC-DD-EE
3 1 AA-BB-UU-VV
4 1 AA-BB-UU
5 1 FF-MM-NN
6 1 FF-MM-NN-PP
7 2 XX-YY
8 2 XX-ZZ
9 2 XX-ZZ-AA