Home > Net >  How to compare a list to a column in a data frame and print all rows where the list matches the colu
How to compare a list to a column in a data frame and print all rows where the list matches the colu

Time:05-24

I have a data frame where one of the columns lists the gene my genetic mutations are associated with (last column).

0    chr1    6667742        T  TTC          HIGH             frameshift_variant     DNAJC11
1    chr1    8360467        G   GC          HIGH             frameshift_variant        RERE
2    chr1   10658519        T    A      MODERATE               missense_variant       CASZ1
3    chr1   12892965        T    G      MODERATE               missense_variant    PRAMEF10
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
..    ...        ...      ...  ...           ...                            ...         ...
443  chrX  131273813        G    C      MODERATE               missense_variant       IGSF1
444  chrX  141003622        A    G      MODERATE               missense_variant     SPANXB1
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185
447  chrY   12722134       CA    C          HIGH             frameshift_variant       USP9Y

I also have a list of genes that I want to see if my data frame contains. I have been able to compare my list of genes to my data frame and print the genes that matched. However, what I am trying to do now is have the script print out the entire row where a match occurs so that I have all the information associated with that match.

I isolated the column containing the genes associated with each genetic mutation using.

gene_column=data_frame.iloc[:,6]

And compared that to the list of genes I am interested in, which I inputted from a txt file.

genes_of_interest_txt = open(r'E:\bcf_analysis\gene_list\met_associated_genes_new_line.txt', "r") #opens my list of genes written as each item on a new line 
genes_of_interest = genes_of_interest_txt.read() #reads next file
genes_of_interest_list = genes_of_interest.split ("\n") #makes text file a list

I then found all the matches using these nested for loops.

for i in genes_of_interest_list: 
    for num in gene_column: 
        if num == i:

Now I am trying to figure out how to print the whole row associated with the match. I am trying to build a flagging system thing to flag the rows where there is a match and then select all flag rows and output them into a new .csv file.


length_of_dataframe = 449
match_flag = np.zeros((length_of_file, 1), dtype=int, order='C')



num = int(0)


for i in genes_of_interest_list: 
    for num in gene_column:
        if num == i : 
            match_flag[num]= 1
            
print (match_flag)

I am getting the following error.

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I am a total nooby at coding, so if you have a better method please let me know.

NOTE: I am using the numpy and pandas libraries.

CodePudding user response:

If I'm not mistaken, you just want to save the dataframe consisting of all the matched genes into csv file? In this case, you can first do a list comprehension to obtain a list of matched genes, then use them to lookup to your dataframe.

matched_list = [num/i for i in genes_of_interest_list for num in gene_column if num == i] 
# I'm not sure which one expected output, you can change the `num` or `i` according to what you want

new_df = data_frame[data_frame['Your last column name'].isin(matched_list)]

new_df.to_csv("some_file_name.csv", index = None)

CodePudding user response:

Not sure I follow your request. But do you mean something like this? This however means converting your data frame into text. Maybe this is not an option.

Code:

text = '''0    chr1    6667742        T  TTC          HIGH             frameshift_variant     DNAJC11
1    chr1    8360467        G   GC          HIGH             frameshift_variant        RERE
2    chr1   10658519        T    A      MODERATE               missense_variant       CASZ1
3    chr1   12892965        T    G      MODERATE               missense_variant    PRAMEF10
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
..    ...        ...      ...  ...           ...                            ...         ...
443  chrX  131273813        G    C      MODERATE               missense_variant       IGSF1
444  chrX  141003622        A    G      MODERATE               missense_variant     SPANXB1
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185
447  chrY   12722134       CA    C          HIGH             frameshift_variant       USP9Y'''

split_text = text.split('\n') #split by rows

print('first example:')
for line in split_text:
    if "KAZN" in line:
        print(line)

print('\n')     
check_this = ['KAZN', 'ZNF185']

print('second example:')
for line in split_text:
    if any(x in line for x in check_this):
        print(line)

Output:

first example:
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN

second example:
4    chr1   14599118     AGCG    A      MODERATE  conservative_inframe_deletion        KAZN
445  chrX  152919025  CGAGGAG    C      MODERATE    disruptive_inframe_deletion      ZNF185
446  chrX  152919025  CGAGGAG    C      MODERATE               sequence_feature      ZNF185

[Program finished]

Either look for phrase in line and print the line if it finds a match as in first example.

Second example looks for matches from a list of phrases.

  • Related