Home > Software design >  Keep only characters in between two hyphens from a dataframe cell which has comma separated list
Keep only characters in between two hyphens from a dataframe cell which has comma separated list

Time:03-09

I have a database for eg.

Name    Subset
Apple   -AI-,-BI-A,-XC-,ZX-
Bat     -po-,-IJ-,-IA-B

and want to convert it to:

Name    Subset
Apple   AI,BI,XC,ZX
Bat     po,IJ,IA

i.e removing the first hyphen and removing the second hyphen and charachters following it until next comma appears.

CodePudding user response:

IIUC, use extractall with the -([^-] )- regex to get only the values between dashes. Then GroupBy.agg with ','.join to concatenate the values:

df['Subset'] = (df['Subset'].str.extractall(r'-([^-] )-')[0]
                .groupby(level=0).agg(','.join)
                )

output:

    Name    Subset
0  Apple  AI,BI,XC
1    Bat  po,IJ,IA

To handle the case where the left dash can be missing, use [-,] ([^-,] )-:

df['Subset'] = (df['Subset'].str.extractall(r'[-,] ([^-,] )-')[0]
                .groupby(level=0).agg(','.join))

output:

    Name       Subset
0  Apple  AI,BI,XC,ZX
1    Bat     po,IJ,I

CodePudding user response:

Use Series.str.split, Series.explode and Series.groupby.agg:

In [2193]: df['Subset'] = df.Subset.str.split(',').explode().str.split('-').str[:2].apply(''.join).groupby(level=0).agg(','.join)

In [2194]: df
Out[2194]: 
    Name       Subset
0  Apple  AI,BI,XC,ZX
1    Bat     po,IJ,IA
  • Related