Home > OS >  If column value contains symbol, retain only the substring after the symbol
If column value contains symbol, retain only the substring after the symbol

Time:08-16

If "index" column value is separated by ";", then retain only the substring after the ";". Else, retain as-is. Would be even better if its in list comprehension. My code raised ValueError: Length of values (4402) does not match length of index (22501).

# If gene name is separated by ";", then get the substring after the ";"
list = []
for i in meth["Name"]:
    if ";" in i:
        list.append(i.split(";",1)[1])
    else:
        continue

meth["Name"] = list

Traceback:

--> 532             "Length of values "
    533             f"({len(data)}) "
    534             "does not match length of index "

ValueError: Length of values (4402) does not match length of index (22501)

Sample data:

meth.iloc[0:4,0:4].to_dict()

{'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
 'TCGA-2K-A9WE-01A': {0: 0.27891582736223297,
  1: 0.786837244239289,
  2: 0.5310546143038515,
  3: 0.7119161837613309},
 'TCGA-2Z-A9J1-01A': {0: 0.318496987871566,
  1: 0.386177267500376,
  2: 0.5086236274690276,
  3: 0.4036012750884792},
 'TCGA-2Z-A9J2-01A': {0: 0.400119915667055,
  1: 0.54983504208745,
  2: 0.5352071929258406,
  3: 0.6139719037555759}}

CodePudding user response:

Are you trying to perform this operation on the column names, or to values of a specific column?

Either way, I think this will do the job:

import pandas as pd

# Define the example dataframe
df = pd.DataFrame(
    {
        'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
        'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
                             2: 0.5310546143038515, 3: 0.7119161837613309},
        'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
                             2: 0.5086236274690276, 3: 0.4036012750884792},
        'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
                             2: 0.5352071929258406, 3: 0.6139719037555759}
    }
)
# Original dataframe:
df
"""
   index  TCGA-2K-A9WE-01A  TCGA-2Z-A9J1-01A  TCGA-2Z-A9J2-01A
0   A1BG          0.278916          0.318497          0.400120
1   A1CF          0.786837          0.386177          0.549835
2  A2BP1          0.531055          0.508624          0.535207
3  A2LD1          0.711916          0.403601          0.613972
"""
# Replacing column names, using '-' as separator:

df.columns = df.columns.astype(str).str.split('-').str[-1]

# Modified dataframe:
df
"""
   index       01A       01A       01A
0   A1BG  0.278916  0.318497  0.400120
1   A1CF  0.786837  0.386177  0.549835
2  A2BP1  0.531055  0.508624  0.535207
3  A2LD1  0.711916  0.403601  0.613972
"""

You can apply the same logic to your dataframe index, or specific columns:


df = pd.DataFrame(
    {
        'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
        'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
                             2: 0.5310546143038515, 3: 0.7119161837613309},
        'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
                             2: 0.5086236274690276, 3: 0.4036012750884792},
        'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
                             2: 0.5352071929258406, 3: 0.6139719037555759},
        'name': {0: 'A;B;C', 1: 'AAA', 2: 'BBB', 3: 'C-DE'}
    }
)

# Original dataframe:
df
"""
   index  TCGA-2K-A9WE-01A  TCGA-2Z-A9J1-01A  TCGA-2Z-A9J2-01A   name
0   A1BG          0.278916          0.318497          0.400120  A;B;C
1   A1CF          0.786837          0.386177          0.549835    AAA
2  A2BP1          0.531055          0.508624          0.535207    BBB
3  A2LD1          0.711916          0.403601          0.613972   C-DE
"""

# Splitting values from column "name":

df['name'] = df['name'].astype(str).str.split(';').str[-1]
df
"""
   index  TCGA-2K-A9WE-01A  TCGA-2Z-A9J1-01A  TCGA-2Z-A9J2-01A  name
0   A1BG          0.278916          0.318497          0.400120     C
1   A1CF          0.786837          0.386177          0.549835   AAA
2  A2BP1          0.531055          0.508624          0.535207   BBB
3  A2LD1          0.711916          0.403601          0.613972  C-DE
"""

Note

Please note that if the column values hold multiple repetitions of the same separator (e.g.: "A;B;C"), only the last substring gets returned (for "A;B;C", returns "C").

  • Related