I have a dataset here in which I have to group each sample. The group of each sample is apart of the sample's name. I have the rows each with a semi unique heading. e.g
TCGA.02.0047.GBM.C4, TCGA.02.0055.GBM.C4, TCGA.ZS.A9CG.LIHC.C3, TCGA.ZU.A8S4.CHOL.C1, TCGA.ZX.AA5X.CESC.C2.
I need to target the C bit in the heading and group the values in that heading so that each sample will be in either, C1, C2, C3 or C4.
How would I go about doing this?
I haven't tried anything yet as I can't find anything relating to it.
CodePudding user response:
For example, you have dataset like this:
import pandas as pd
df = pd.DataFrame({"Column_A": ["TCGA.02.0047.GBM.C4", "TCGA.02.0055.GBM.C4", "TCGA.ZS.A9CG.LIHC.C3", "TCGA.ZU.A8S4.CHOL.C1", "TCGA.ZX.AA5X.CESC.C2"]})
Column_A
0 TCGA.02.0047.GBM.C4
1 TCGA.02.0055.GBM.C4
2 TCGA.ZS.A9CG.LIHC.C3
3 TCGA.ZU.A8S4.CHOL.C1
4 TCGA.ZX.AA5X.CESC.C2
You can add new column with group:
df["Group"] = df["Column_A"].str[-2:]
Column_A Group
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2
CodePudding user response:
if you have column names
You can extract
the part after the last period and use it as grouper:
df.groupby(df.columns.str.extract('([^.] )$', expand=False), axis=1)
Then perform the desired aggregation.
if you have one column
df['new'] = df['col'].str.extract('([^.] )$')
Output:
col new
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2