I have a data frame where one of the columns lists the gene my genetic mutations are associated with (last column).
0 chr1 6667742 T TTC HIGH frameshift_variant DNAJC11
1 chr1 8360467 G GC HIGH frameshift_variant RERE
2 chr1 10658519 T A MODERATE missense_variant CASZ1
3 chr1 12892965 T G MODERATE missense_variant PRAMEF10
4 chr1 14599118 AGCG A MODERATE conservative_inframe_deletion KAZN
.. ... ... ... ... ... ... ...
443 chrX 131273813 G C MODERATE missense_variant IGSF1
444 chrX 141003622 A G MODERATE missense_variant SPANXB1
445 chrX 152919025 CGAGGAG C MODERATE disruptive_inframe_deletion ZNF185
446 chrX 152919025 CGAGGAG C MODERATE sequence_feature ZNF185
447 chrY 12722134 CA C HIGH frameshift_variant USP9Y
I also have a list of genes that I want to see if my data frame contains. I have been able to compare my list of genes to my data frame and print the genes that matched. However, what I am trying to do now is have the script print out the entire row where a match occurs so that I have all the information associated with that match.
I isolated the column containing the genes associated with each genetic mutation using.
gene_column=data_frame.iloc[:,6]
And compared that to the list of genes I am interested in, which I inputted from a txt file.
genes_of_interest_txt = open(r'E:\bcf_analysis\gene_list\met_associated_genes_new_line.txt', "r") #opens my list of genes written as each item on a new line
genes_of_interest = genes_of_interest_txt.read() #reads next file
genes_of_interest_list = genes_of_interest.split ("\n") #makes text file a list
I then found all the matches using these nested for loops.
for i in genes_of_interest_list:
for num in gene_column:
if num == i:
Now I am trying to figure out how to print the whole row associated with the match. I am trying to build a flagging system thing to flag the rows where there is a match and then select all flag rows and output them into a new .csv file.
length_of_dataframe = 449
match_flag = np.zeros((length_of_file, 1), dtype=int, order='C')
num = int(0)
for i in genes_of_interest_list:
for num in gene_column:
if num == i :
match_flag[num]= 1
print (match_flag)
I am getting the following error.
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I am a total nooby at coding, so if you have a better method please let me know.
NOTE: I am using the numpy and pandas libraries.
CodePudding user response:
If I'm not mistaken, you just want to save the dataframe consisting of all the matched genes into csv file? In this case, you can first do a list comprehension to obtain a list of matched genes, then use them to lookup to your dataframe.
matched_list = [num/i for i in genes_of_interest_list for num in gene_column if num == i]
# I'm not sure which one expected output, you can change the `num` or `i` according to what you want
new_df = data_frame[data_frame['Your last column name'].isin(matched_list)]
new_df.to_csv("some_file_name.csv", index = None)
CodePudding user response:
Not sure I follow your request. But do you mean something like this? This however means converting your data frame into text. Maybe this is not an option.
Code:
text = '''0 chr1 6667742 T TTC HIGH frameshift_variant DNAJC11
1 chr1 8360467 G GC HIGH frameshift_variant RERE
2 chr1 10658519 T A MODERATE missense_variant CASZ1
3 chr1 12892965 T G MODERATE missense_variant PRAMEF10
4 chr1 14599118 AGCG A MODERATE conservative_inframe_deletion KAZN
.. ... ... ... ... ... ... ...
443 chrX 131273813 G C MODERATE missense_variant IGSF1
444 chrX 141003622 A G MODERATE missense_variant SPANXB1
445 chrX 152919025 CGAGGAG C MODERATE disruptive_inframe_deletion ZNF185
446 chrX 152919025 CGAGGAG C MODERATE sequence_feature ZNF185
447 chrY 12722134 CA C HIGH frameshift_variant USP9Y'''
split_text = text.split('\n') #split by rows
print('first example:')
for line in split_text:
if "KAZN" in line:
print(line)
print('\n')
check_this = ['KAZN', 'ZNF185']
print('second example:')
for line in split_text:
if any(x in line for x in check_this):
print(line)
Output:
first example:
4 chr1 14599118 AGCG A MODERATE conservative_inframe_deletion KAZN
second example:
4 chr1 14599118 AGCG A MODERATE conservative_inframe_deletion KAZN
445 chrX 152919025 CGAGGAG C MODERATE disruptive_inframe_deletion ZNF185
446 chrX 152919025 CGAGGAG C MODERATE sequence_feature ZNF185
[Program finished]
Either look for phrase in line and print the line if it finds a match as in first example.
Second example looks for matches from a list of phrases.