If "index" column value is separated by ";", then retain only the substring after the ";". Else, retain as-is. Would be even better if its in list comprehension.
My code raised ValueError: Length of values (4402) does not match length of index (22501)
.
# If gene name is separated by ";", then get the substring after the ";"
list = []
for i in meth["Name"]:
if ";" in i:
list.append(i.split(";",1)[1])
else:
continue
meth["Name"] = list
Traceback:
--> 532 "Length of values "
533 f"({len(data)}) "
534 "does not match length of index "
ValueError: Length of values (4402) does not match length of index (22501)
Sample data:
meth.iloc[0:4,0:4].to_dict()
{'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297,
1: 0.786837244239289,
2: 0.5310546143038515,
3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566,
1: 0.386177267500376,
2: 0.5086236274690276,
3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055,
1: 0.54983504208745,
2: 0.5352071929258406,
3: 0.6139719037555759}}
CodePudding user response:
Are you trying to perform this operation on the column names, or to values of a specific column?
Either way, I think this will do the job:
import pandas as pd
# Define the example dataframe
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
# Replacing column names, using '-' as separator:
df.columns = df.columns.astype(str).str.split('-').str[-1]
# Modified dataframe:
df
"""
index 01A 01A 01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
You can apply the same logic to your dataframe index, or specific columns:
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759},
'name': {0: 'A;B;C', 1: 'AAA', 2: 'BBB', 3: 'C-DE'}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 A;B;C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
# Splitting values from column "name":
df['name'] = df['name'].astype(str).str.split(';').str[-1]
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
Note
Please note that if the column values hold multiple repetitions of the same separator (e.g.: "A;B;C"
), only the last substring gets returned (for "A;B;C"
, returns "C"
).