I am attempting to alter values in multiple columns based on corresponding values in other columns. I have been able to do this by hard coding, but I would appreciate any help in automating the following code so it can be replicated for any number of samples. Below, I share a minimal example input, ideal output and the working code. Note - I am still a bit green in python so comments go a long way.
Input-
01_s_IDX_type 01_s_IDX 02_s_IDY_type 02_s_IDY
HET 0/1:10,9:19:99:202,0,244 HET 0/1:18,1:19:99:202,0,244
HOM 0/1:20,0:20:99:202,0,244 HOM 0/1:50,0:50:99:202,0,244
Here, values from the IDX column are used to re-value the IDX_type columns. The information of interest are the 3rd and 4th integers in the IDX columns: 10,9 and 18,1. For sample 01 the ratio between 10:9 is between 0.7-1.3 so its type can stay as HET. For sample 02 the ratio between 18:1 is not between 0.7-1.3 so it's type is changed to REF.
Output-
01_s_IDX_type 01_s_IDX 02_s_IDY_type 02_s_IDY
HET 0/1:10,9:19:99:202,0,244 REF 0/1:18,1:19:99:202,0,244
HOM 0/1:20,0:20:99:202,0,244 HOM 0/1:50,0:50:99:202,0,244
Here is the code that achieved this.
#Create toy example
df = {'01_s_IDX_type': ['HET', 'HOM'],
'01_s_IDX': ['0/1:10,9:19:99:202,0,244', '0/1:20,0:20:99:202,0,244'],
'02_s_IDX_type': ['REF', 'HOM'],
'02_s_IDX': ['0/1:18,1:19:99:202,0,244', '0/1:0,50:50:99:202,0,244']
}
df = pd.DataFrame(df)
print (df)
#create new dfs for each sample
df_01, df_02 = df.filter(regex=r'^01'), df.filter(regex=r'^02')
#make copy of the info column
df_01_copy = df_01['01_s_IDX']
df_02_copy = df_02['02_s_IDX']
#remove unneeded parts of the column (first four characters)
df_01_copy = df_01_copy.str[4:]
df_02_copy = df_02_copy.str[4:]
#replace all commas with colons
df_01_copy = df_01_copy.replace(to_replace =',', value = ':', regex = True)
df_02_copy = df_02_copy.replace(to_replace =',', value = ':', regex = True)
#split into new columns by :
df_01_copy = df_01_copy.str.split(pat=':',expand=True)
df_02_copy = df_02_copy.str.split(pat=':',expand=True)
#keep first two columns
df_01_copy = df_01_copy.iloc[:,:2]
df_02_copy = df_02_copy.iloc[:,:2]
#rename columns
df_01_copy.columns = ['DP1', 'DP2']
df_02_copy.columns = ['DP1', 'DP2']
#convert to numeric, calculate ratios and add the ratios to OG dfs
df_01_copy = df_01_copy.apply(pd.to_numeric)
df_01['ratio'] = df_01_copy.DP1.div(df_01_copy.DP2)
df_02_copy = df_02_copy.apply(pd.to_numeric)
df_02['ratio'] = df_02_copy.DP1.div(df_02_copy.DP2)
#Keep HET if ratio is between 1.3-0.7, if ratio = 0 then HOM
df_01.loc[(df_01['ratio'] > 1.3), '01_s_IDX_type'] = 'REF'
df_01.loc[(df_01['ratio'] < 0.7), '01_s_IDX_type'] = 'REF'
df_01.loc[(df_01['ratio'] == 0), '01_s_IDX_type'] = 'HOM'
df_02.loc[(df_02['ratio'] > 1.3), '02_s_IDX_type'] = 'REF'
df_02.loc[(df_02['ratio'] < 0.7), '02_s_IDX_type'] = 'REF'
df_02.loc[(df_02['ratio'] == 0 ), '02_s_IDX_type'] = 'HOM'
#Rejoin
df_het = pd.concat([df_01, df_02, axis=1, join="outer")
df_out = df_het.drop('ratio', axis=1)
I have datasets which may consist of n samples, so turning this code into a pipeline/ function would be ideal. Thanks in advance for any help on this.
CodePudding user response:
Will your samples continue to repeat in new columns or will they be appended to subsequent rows? For example, if you have two more samples should we expect they will occupy new columns such as 03_s_IDX_type
and 03_s_IDX
, 04_s_IDX_type
and 04_s_IDX
, and so on? Or would these new samples appear as a new row underneath your existing row?
Assuming the former (that is, that each new sample is associated with two new columns), here's a solution I came up with. I know you said you wanted a loop, but this actually avoids looping and therefore should be faster and more efficient.
First, let's build the dataframe using the values you supplied.
df = {'01_s_IDX_type': ['HET'],
'01_s_IDX': ['0/1:10,9:19:99:202,0,244'],
'02_s_IDX_type': ['REF'],
'02_s_IDX': ['0/1:18,1:19:99:202,0,244']
}
ratio_bounds = (0.7, 1.3)
df = pd.DataFrame(df)
Next, if we can assume that each new sample abides by the same naming conventions, we can extract each column that contains similar data. This means that we can create a single dataframe that contains a column for the chromosomal data and use this to evaluate the genotype later on. We can also split the names of these to extract the sample ID number.
chro_vals = df[df.columns[1::2]]
idxs = [i.split('_')[0] for i in chro_vals]
Now, we're going to build a new dataframe from this simplified and restructured data. Since we don't need the old dataframe anymore, I'm just going to overwrite it by making a new dataframe with the same variable name. Also, using some of the tricks that you used in your first iteration, we can extract the experimental values that you are evaluating. I'll put them in two columns called Val1
and Val2
.
df = pd.DataFrame(columns=['Raw_values'], index=idxs, data=chro_vals.values.T)
df['Ratio_vals'] = df.Raw_values.str.split(pat=':', expand=True)[1]
df[['Val1', 'Val2']] = df.Ratio_vals.str.split(pat=',', expand=True).astype(int)
Now we can compute the ratio that you're interested in by dividing these columns. Once that's done, we can define a simple function that will be passed to df.apply()
to label the genotype of each of your samples.
def get_genotype(row):
if row['Observed_Ratio'] == 0:
return 'HOM'
elif row['Observed_Ratio'] > ratio_bounds[0] and row['Observed_Ratio'] < ratio_bounds[1]:
return 'REF'
else:
return 'HET'
df['Genotype'] = df.apply(lambda row: get_genotype(row), axis=1)
print(df)
If you want, you could subset the dataframe here to make it look a bit cleaner, but other than that you should have everything you need:
Raw_values Ratio_vals Val1 Val2 Observed_Ratio Genotype
01 0/1:10,9:19:99:202,0,244 10,9 10 9 1.111111 REF
02 0/1:18,1:19:99:202,0,244 18,1 18 1 18.000000 HET
Here is an updated answer to generalize to multiple rows. This problem may be trickier but I think I found a solution. It involves first concatenating all of your raw IDX values to a single column in a dataframe, then after your genotypes are evaluated, the dataframe is transformed into more or less its original shape.
First, similar definitions as above, this time incorporating the updated dataframe values.
import pandas as pd
df = {'01_s_IDX_type': ['HET', 'HOM'],
'01_s_IDX': ['0/1:10,9:19:99:202,0,244', '0/1:20,0:20:99:202,0,244'],
'02_s_IDX_type': ['REF', 'HOM'],
'02_s_IDX': ['0/1:18,1:19:99:202,0,244', '0/1:0,50:50:99:202,0,244']
}
ratio_bounds = (0.7, 1.3)
df = pd.DataFrame(df)
Here I think it's useful to label each row. I chose to call them "runs" but you might have some other more informative name. Then I separated the two columns associated with sample 2 to a separate dataframe and renamed those columns to match sample 1 - this will allow us to concatenate the dataframes with no issues.
df.index = [f'Run{i 1}' for i in range(df.shape[0])]
df_renamed = df[['02_s_IDX', '02_s_IDX_type']].rename(columns={'02_s_IDX': '01_s_IDX',
'02_s_IDX_type': '01_s_IDX_type'})
df = pd.concat([df[['01_s_IDX', '01_s_IDX_type']], df_renamed], axis=0).loc[:, ['01_s_IDX']]
By now we've created a dataframe with a single column representing the IDX values, which we will use to evaluate the genotypes in the same way as before.
df.columns = ['IDX']
df['Ratio_vals'] = df.IDX.str.split(pat=':', expand=True)[1]
df[['Val1', 'Val2']] = df.Ratio_vals.str.split(pat=',', expand=True).astype(int)
df['Observed_Ratio'] = df['Val1'] / df['Val2']
def get_genotype(row):
if row['Observed_Ratio'] == 0:
return 'HOM'
elif row['Observed_Ratio'] > ratio_bounds[0] and row['Observed_Ratio'] < ratio_bounds[1]:
return 'REF'
else:
return 'HET'
df['Genotype'] = df.apply(lambda row: get_genotype(row), axis=1)
Finally, we can just merge these dataframes together using the index values (i.e. the "runs" that we defined above) as a basis to merge on. Since the columns are identical, we can pass suffixes for the left and right dataframes.
df = df.iloc[:int(df.shape[0] / 2), :].merge(df.iloc[int(df.shape[0] / 2):, :], left_index=True,
right_index=True, suffixes=('_01', '_02'))
print(df)
If you inspect this output you will see you end up with a dataframe of shape (2, 12). From here you can rename the columns as you wish or subset the dataframe to clean it up further. Hope this helps!