Home > Software engineering >  How to check that rows start with or contain same value when grouped by other column in pandas
How to check that rows start with or contain same value when grouped by other column in pandas

Time:09-09

I have a dataframe that I'd like to group by id and the common 'top level' string with the count of 'lower-level' strings it includes. For example,

    id name
    1  AA-BB-CC-DD
    1  AA-BB-CC
    1  AA-BB-CC-DD-EE
    1  AA-BB-UU-VV
    1  AA-BB-UU
    1  FF-MM-NN
    1  FF-MM-NN-PP
    2  XX-YY
    2  XX-ZZ
    2  XX-ZZ-AA

for id 1 the first top level of column name is AA, second BB, third CC, etc. The common 'highest' level for id 1 in the data frame is AA-BB-CC.

The desired output is

    id  name      count
    1   AA-BB-CC  3
    1   AA-BB-UU  2
    1   FF-MM-NN  2
    2   XX-YY     1
    2   XX-ZZ     2

Thank you.

CodePudding user response:

IIUC, you can form groups using the first level, then groupby.agg using a custom function:

group = df['name'].str.extract('^([^-] )', expand=False)

def min_string(s):
    return min(s, key=lambda x: x.count('-'))

out = (df
 .groupby(['id', group], as_index=False)
 .agg(name=('name', min_string),
      count=('name', 'count')
     )
)

output:

   id      name  count
0   1  AA-BB-CC      3
1   1  FF-MM-NN      2
2   2     XX-YY      3

update: handle minimal set of first level(s)

change min_string to:

def min_string(s):
    return '-'.join(x[0] for x in zip(*s.str.split('-'))
                    if len(set(x)) == 1)

example input:

   id            name
0   1     AA-BB-CC-DD
1   1        AA-BB-CC
2   1  AA-BB-CC-DD-EE
3   1     AA-BB-UU-VV
4   1        AA-BB-UU
5   1        FF-MM-NN
6   1     FF-MM-NN-PP
7   2           XX-YY
8   2           XX-ZZ
9   2        XX-ZZ-AA

output:

   id      name  count
0   1     AA-BB      5
1   1  FF-MM-NN      2
2   2        XX      3

update2: maximal common subgroup

def make_groups(s, sep='-'):
    d = {}
    s = s.str.split(sep).sort_values()
    prev = s.iloc[0]
    for idx, val in s.items():
        if val[:len(prev)] != prev:
            prev = val
        d[idx] = sep.join(prev)
    return pd.Series(d, index=s.index)

group = df.groupby('id', group_keys=False)['name'].apply(make_groups)

out = (df
 .groupby(['id', group], as_index=False)
 .agg(name=('name', min_string),
      count=('name', 'count')
     )
)

output:

   id      name  count
0   1  AA-BB-CC      3
1   1  AA-BB-UU      2
2   1  FF-MM-NN      2
3   2     XX-YY      1
4   2     XX-ZZ      2

used input:

   id            name
0   1     AA-BB-CC-DD
1   1        AA-BB-CC
2   1  AA-BB-CC-DD-EE
3   1     AA-BB-UU-VV
4   1        AA-BB-UU
5   1        FF-MM-NN
6   1     FF-MM-NN-PP
7   2           XX-YY
8   2           XX-ZZ
9   2        XX-ZZ-AA
  • Related