I am trying to match Sample ID
s to a list of tissue_names
. One Sample ID
could have more than one tissue. Hence, I have created an empty array initially and want to add the tissue names to tissue_name
column below.
TCGA_luad['tissue_name'] = 'NA'
for index, row in TCGA_luad.iterrows():
for item in TCGA_lung_tissue_names:
if row['Sample ID'] in item:
if row['tissue_name'] == 'NA':
TCGA_luad.at[index, 'tissue_name'] = []
TCGA_luad.at[index, 'tissue_name'].append(item)
else:
print('here')
TCGA_luad.at[index, 'tissue_name'].append(item)
While I have more than one Tissue Name for many of the cases belonging to the same Sample ID, it never goes to the second part of else, and 'here' doesn't get printed.
However, the tissue names doesn't get appended and I get all items as []
. Do you know why the append doesn't work?
/tmp/ipykernel_2331339/2964965853.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
TCGA_luad['tissue_name'] = 'NA'
I end up with all of tissue_name
column being []
len(TCGA_lung_tissue_names)
3206
TCGA_lung_tissue_names[:3]
['TCGA-05-4244-01A-01-BS1',
'TCGA-05-4244-01A-01-TS1',
'TCGA-05-4244-01Z-00-DX1']
CodePudding user response:
I think a simple apply
statement will be easier to understand, shorter, and possibly more performant:
df['tissue_name'] = df['Sample ID'].apply(lambda sid: [item for item in TCGA_lung_tissue_names if sid in item] or 'NA')
CodePudding user response:
Probably not the best solution, but the following worked:
TCGA_luad['tissue_name'] = 'NA'
duplicates = {}
for index, row in TCGA_luad.iterrows():
duplicate_counter = 0
duplicates[row['Patient ID']] = []
for item in TCGA_lung_tissue_names:
if row['Patient ID'] in item:
if row['Sample ID'] in item:
duplicate_counter = 1
duplicates[row['Patient ID']].append(item)
if row['tissue_name'] == 'NA':
TCGA_luad.at[index, 'tissue_name'] = []
TCGA_luad.at[index, 'tissue_name'] = duplicates[row['Patient ID']]
if duplicate_counter > 1:
print(duplicate_counter)