I have a dataframe
import pandas as pd
data_as_dict={'CHROM': {2241743: 21, 2241744: 3, 2241745: 5, 2241746: 6, 2241747: 6, 2241748: 11, 2241749: 16, 2241750: 20, 0: 'chr1', 1: 'chr2', 2: 'chr3', 3: 'chr5', 4: 'chr5', 5: 'chr8', 6: 'chr9', 7: 'chr10', 8: 'chr14', 9: 'chr14', 10: 'chr15', 11: 'chr15'}, 'POS_GRCh38': {2241743: 41668857, 2241744: 189487243, 2241745: 33946466, 2241746: 396321, 2241747: 32641081, 2241748: 89284793, 2241749: 89960104, 2241750: 34077942, 0: '233276815', 1: '217427435', 2: '169800667', 3: '1279675', 4: '112150207', 5: '32575278', 6: '97775520', 7: '103934543', 8: '36063370', 9: '36269155', 10: '67165147', 11: '67163292'}, 'REF': {2241743: 'T', 2241744: 'A', 2241745: 'A', 2241746: 'C', 2241747: 'C', 2241748: 'G', 2241749: 'T', 2241750: 'G', 0: 'G', 1: 'G', 2: 'G', 3: 'C', 4: 'T', 5: 'T', 6: 'C', 7: 'C', 8: 'C', 9: 'C', 10: 'G', 11: 'C'}, 'Effect_allele': {2241743: 'A', 2241744: 'T', 2241745: 'G', 2241746: 'T', 2241747: 'T', 2241748: 'A', 2241749: 'C', 2241750: 'A', 0: 'A', 1: 'C', 2: 'T', 3: 'T', 4: 'A', 5: 'G', 6: 'A', 7: 'T', 8: 'G', 9: 'T', 10: 'C', 11: 'T'}, 'Effect_size': {2241743: 0.094310679, 2241744: 0.122217633, 2241745: 0.527632769, 2241746: 0.482426149, 2241747: 0.157003749, 2241748: 0.148420005, 2241749: 0.285178942, 2241750: 0.2390169, 0: 0.2776317365982795, 1: 0.3576744442718159, 2: 0.2070141693843261, 3: 0.1823215567939546, 4: 0.3148107398400336, 5: 0.2776317365982795, 6: 0.5247285289349821, 7: 0.3435897043900768, 8: 0.3293037471426003, 9: 0.5933268452777344, 10: 0.2070141693843261, 11: 0.2151113796169455}, 'TYPE': {2241743: 'Basal_cell_carcinoma_PRSWeb', 2241744: 'Squamous_cell_carcinoma_PRSWeb', 2241745: 'Squamous_cell_carcinoma_PRSWeb', 2241746: 'Squamous_cell_carcinoma_PRSWeb', 2241747: 'Squamous_cell_carcinoma_PRSWeb', 2241748: 'Squamous_cell_carcinoma_PRSWeb', 2241749: 'Squamous_cell_carcinoma_PRSWeb', 2241750: 'Squamous_cell_carcinoma_PRSWeb', 0: 'THYROID_PGS', 1: 'THYROID_PGS', 2: 'THYROID_PGS', 3: 'THYROID_PGS', 4: 'THYROID_PGS', 5: 'THYROID_PGS', 6: 'THYROID_PGS', 7: 'THYROID_PGS', 8: 'THYROID_PGS', 9: 'THYROID_PGS', 10: 'THYROID_PGS', 11: 'THYROID_PGS'}, 'Cancer': {2241743: 'NMSC', 2241744: 'NMSC', 2241745: 'NMSC', 2241746: 'NMSC', 2241747: 'NMSC', 2241748: 'NMSC', 2241749: 'NMSC', 2241750: 'NMSC', 0: 'THYROID', 1: 'THYROID', 2: 'THYROID', 3: 'THYROID', 4: 'THYROID', 5: 'THYROID', 6: 'THYROID', 7: 'THYROID', 8: 'THYROID', 9: 'THYROID', 10: 'THYROID', 11: 'THYROID'}, 'Significant_YN': {2241743: 'Y', 2241744: 'Y', 2241745: 'Y', 2241746: 'Y', 2241747: 'Y', 2241748: 'Y', 2241749: 'Y', 2241750: 'Y', 0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y', 5: 'Y', 6: 'Y', 7: 'Y', 8: 'Y', 9: 'Y', 10: 'Y', 11: 'Y'}}
all_cancers = pd.DataFrame.from_dict(data_as_dict)
I want to append string chr
in column CHROM
where it's missing. I can do it in R with grepl
and paste
, but wanted to try in Python.
I came up with these two commands, but not sure how to index the column because pd.Series is generating NaNs.
pd.Series(all_cancers['CHROM']).str.contains(pat = 'chr', regex = True)
"chr" all_cancers['CHROM'].map(str)
CodePudding user response:
String operations in pandas are not optimized, so the best way to do what you want is via a list comprehension.
# check if a value contains 'chr' and prepend it if not
all_cancers['CHROM'] = [x if isinstance(x, str) and 'chr' in x else f"chr{x}" for x in all_cancers['CHROM'].tolist()]
An equivalent pandas operation could be via mask()
method.
# flag rows that starts with 'chr' and prepend 'chr' to values in the remaining rows
all_cancers['CHROM'] = all_cancers['CHROM'].mask(~all_cancers['CHROM'].str.startswith('chr', na=False), 'chr' all_cancers['CHROM'].astype(str))
A simple timeit
test would show that the list comp is much faster.
CodePudding user response:
See also:
- How to conditionally update DataFrame column in Pandas
- Replace all values that do not contain a specific string in pandas dataframe
For example using NumPy's conditional filter np.where
:
import pandas as pd
import numpy as np
data_as_dict = {
'CHROM': {2241743: 21, 2241744: 3, 2241745: 5, 2241746: 6, 2241747: 6, 2241748: 11, 2241749: 16, 2241750: 20,
0: 'chr1', 1: 'chr2', 2: 'chr3', 3: 'chr5', 4: 'chr5', 5: 'chr8', 6: 'chr9', 7: 'chr10', 8: 'chr14',
9: 'chr14', 10: 'chr15', 11: 'chr15'},
'POS_GRCh38': {2241743: 41668857, 2241744: 189487243, 2241745: 33946466, 2241746: 396321, 2241747: 32641081,
2241748: 89284793, 2241749: 89960104, 2241750: 34077942, 0: '233276815', 1: '217427435',
2: '169800667', 3: '1279675', 4: '112150207', 5: '32575278', 6: '97775520', 7: '103934543',
8: '36063370', 9: '36269155', 10: '67165147', 11: '67163292'},
'REF': {2241743: 'T', 2241744: 'A', 2241745: 'A', 2241746: 'C', 2241747: 'C', 2241748: 'G', 2241749: 'T',
2241750: 'G', 0: 'G', 1: 'G', 2: 'G', 3: 'C', 4: 'T', 5: 'T', 6: 'C', 7: 'C', 8: 'C', 9: 'C', 10: 'G',
11: 'C'},
'Effect_allele': {2241743: 'A', 2241744: 'T', 2241745: 'G', 2241746: 'T', 2241747: 'T', 2241748: 'A',
2241749: 'C', 2241750: 'A', 0: 'A', 1: 'C', 2: 'T', 3: 'T', 4: 'A', 5: 'G', 6: 'A', 7: 'T',
8: 'G', 9: 'T', 10: 'C', 11: 'T'},
'Effect_size': {2241743: 0.094310679, 2241744: 0.122217633, 2241745: 0.527632769, 2241746: 0.482426149,
2241747: 0.157003749, 2241748: 0.148420005, 2241749: 0.285178942, 2241750: 0.2390169,
0: 0.2776317365982795, 1: 0.3576744442718159, 2: 0.2070141693843261, 3: 0.1823215567939546,
4: 0.3148107398400336, 5: 0.2776317365982795, 6: 0.5247285289349821, 7: 0.3435897043900768,
8: 0.3293037471426003, 9: 0.5933268452777344, 10: 0.2070141693843261, 11: 0.2151113796169455},
'TYPE': {2241743: 'Basal_cell_carcinoma_PRSWeb', 2241744: 'Squamous_cell_carcinoma_PRSWeb',
2241745: 'Squamous_cell_carcinoma_PRSWeb', 2241746: 'Squamous_cell_carcinoma_PRSWeb',
2241747: 'Squamous_cell_carcinoma_PRSWeb', 2241748: 'Squamous_cell_carcinoma_PRSWeb',
2241749: 'Squamous_cell_carcinoma_PRSWeb', 2241750: 'Squamous_cell_carcinoma_PRSWeb', 0: 'THYROID_PGS',
1: 'THYROID_PGS', 2: 'THYROID_PGS', 3: 'THYROID_PGS', 4: 'THYROID_PGS', 5: 'THYROID_PGS',
6: 'THYROID_PGS', 7: 'THYROID_PGS', 8: 'THYROID_PGS', 9: 'THYROID_PGS', 10: 'THYROID_PGS',
11: 'THYROID_PGS'},
'Cancer': {2241743: 'NMSC', 2241744: 'NMSC', 2241745: 'NMSC', 2241746: 'NMSC', 2241747: 'NMSC', 2241748: 'NMSC',
2241749: 'NMSC', 2241750: 'NMSC', 0: 'THYROID', 1: 'THYROID', 2: 'THYROID', 3: 'THYROID',
4: 'THYROID', 5: 'THYROID', 6: 'THYROID', 7: 'THYROID', 8: 'THYROID', 9: 'THYROID', 10: 'THYROID',
11: 'THYROID'},
'Significant_YN': {2241743: 'Y', 2241744: 'Y', 2241745: 'Y', 2241746: 'Y', 2241747: 'Y', 2241748: 'Y',
2241749: 'Y', 2241750: 'Y', 0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y', 5: 'Y', 6: 'Y', 7: 'Y',
8: 'Y', 9: 'Y', 10: 'Y', 11: 'Y'}}
all_cancers = pd.DataFrame.from_dict(data_as_dict)
all_cancers['CHROM'] = np.where(~all_cancers['CHROM'].str.startswith("chr", na=False), "chr", all_cancers['CHROM'])