How to compare a list to a column in a data frame and print all rows where the list matches the colu-CodePudding

I have a data frame where one of the columns lists the gene my genetic mutations are associated with (last column).

0    chr1    6667742        T  TTC          HIGH             frameshift_variant     DNAJC11
1    chr1    8360467        G   GC          HIGH             frameshift_variant        RERE
2    chr1   10658519        T    A      MODERATE               missense_variant       CASZ1
3    chr1   12892965        T    G      MODERATE               missense_variant    PRAMEF10
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
..    ...        ...      ...  ...           ...                            ...         ...
443  chrX  131273813        G    C      MODERATE               missense_variant       IGSF1
444  chrX  141003622        A    G      MODERATE               missense_variant     SPANXB1
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185
447  chrY   12722134       CA    C          HIGH             frameshift_variant       USP9Y

I also have a list of genes that I want to see if my data frame contains. I have been able to compare my list of genes to my data frame and print the genes that matched. However, what I am trying to do now is have the script print out the entire row where a match occurs so that I have all the information associated with that match.

I isolated the column containing the genes associated with each genetic mutation using.

gene_column=data_frame.iloc[:,6]

And compared that to the list of genes I am interested in, which I inputted from a txt file.

genes_of_interest_txt = open(r'E:\bcf_analysis\gene_list\met_associated_genes_new_line.txt', "r") #opens my list of genes written as each item on a new line 
genes_of_interest = genes_of_interest_txt.read() #reads next file
genes_of_interest_list = genes_of_interest.split ("\n") #makes text file a list

I then found all the matches using these nested for loops.

for i in genes_of_interest_list: 
    for num in gene_column: 
        if num == i:

Now I am trying to figure out how to print the whole row associated with the match. I am trying to build a flagging system thing to flag the rows where there is a match and then select all flag rows and output them into a new .csv file.


length_of_dataframe = 449
match_flag = np.zeros((length_of_file, 1), dtype=int, order='C')



num = int(0)


for i in genes_of_interest_list: 
    for num in gene_column:
        if num == i : 
            match_flag[num]= 1
            
print (match_flag)

I am getting the following error.

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I am a total nooby at coding, so if you have a better method please let me know.

NOTE: I am using the numpy and pandas libraries.

CodePudding user response：

If I'm not mistaken, you just want to save the dataframe consisting of all the matched genes into csv file? In this case, you can first do a list comprehension to obtain a list of matched genes, then use them to lookup to your dataframe.

matched_list = [num/i for i in genes_of_interest_list for num in gene_column if num == i] 
# I'm not sure which one expected output, you can change the `num` or `i` according to what you want

new_df = data_frame[data_frame['Your last column name'].isin(matched_list)]

new_df.to_csv("some_file_name.csv", index = None)

CodePudding user response：

Not sure I follow your request. But do you mean something like this? This however means converting your data frame into text. Maybe this is not an option.

Code:

text = '''0    chr1    6667742        T  TTC          HIGH             frameshift_variant     DNAJC11
1    chr1    8360467        G   GC          HIGH             frameshift_variant        RERE
2    chr1   10658519        T    A      MODERATE               missense_variant       CASZ1
3    chr1   12892965        T    G      MODERATE               missense_variant    PRAMEF10
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
..    ...        ...      ...  ...           ...                            ...         ...
443  chrX  131273813        G    C      MODERATE               missense_variant       IGSF1
444  chrX  141003622        A    G      MODERATE               missense_variant     SPANXB1
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185
447  chrY   12722134       CA    C          HIGH             frameshift_variant       USP9Y'''

split_text = text.split('\n') #split by rows

print('first example:')
for line in split_text:
    if "KAZN" in line:
        print(line)

print('\n')     
check_this = ['KAZN', 'ZNF185']

print('second example:')
for line in split_text:
    if any(x in line for x in check_this):
        print(line)

Output:

first example:
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN

second example:
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185

[Program finished]

Either look for phrase in line and print the line if it finds a match as in first example.

Second example looks for matches from a list of phrases.