How to group a certain part of a heading in a dataset using pandas?-CodePudding

I have a dataset here in which I have to group each sample. The group of each sample is apart of the sample's name. I have the rows each with a semi unique heading. e.g

TCGA.02.0047.GBM.C4, TCGA.02.0055.GBM.C4, TCGA.ZS.A9CG.LIHC.C3, TCGA.ZU.A8S4.CHOL.C1, TCGA.ZX.AA5X.CESC.C2.

I need to target the C bit in the heading and group the values in that heading so that each sample will be in either, C1, C2, C3 or C4.

How would I go about doing this?

I haven't tried anything yet as I can't find anything relating to it.

CodePudding user response：

For example, you have dataset like this:

import pandas as pd

df = pd.DataFrame({"Column_A": ["TCGA.02.0047.GBM.C4", "TCGA.02.0055.GBM.C4", "TCGA.ZS.A9CG.LIHC.C3", "TCGA.ZU.A8S4.CHOL.C1", "TCGA.ZX.AA5X.CESC.C2"]})


    Column_A
0   TCGA.02.0047.GBM.C4
1   TCGA.02.0055.GBM.C4
2   TCGA.ZS.A9CG.LIHC.C3
3   TCGA.ZU.A8S4.CHOL.C1
4   TCGA.ZX.AA5X.CESC.C2

You can add new column with group:

df["Group"] = df["Column_A"].str[-2:]


    Column_A              Group
0   TCGA.02.0047.GBM.C4     C4
1   TCGA.02.0055.GBM.C4     C4
2   TCGA.ZS.A9CG.LIHC.C3    C3
3   TCGA.ZU.A8S4.CHOL.C1    C1
4   TCGA.ZX.AA5X.CESC.C2    C2

CodePudding user response：

if you have column names

You can extract the part after the last period and use it as grouper:

df.groupby(df.columns.str.extract('([^.] )$', expand=False), axis=1)

Then perform the desired aggregation.

if you have one column

df['new'] = df['col'].str.extract('([^.] )$')

Output:

                    col new
0   TCGA.02.0047.GBM.C4  C4
1   TCGA.02.0055.GBM.C4  C4
2  TCGA.ZS.A9CG.LIHC.C3  C3
3  TCGA.ZU.A8S4.CHOL.C1  C1
4  TCGA.ZX.AA5X.CESC.C2  C2