Home > Enterprise >  Finding unique characters based on order from inputs of different lengths in DataFrame
Finding unique characters based on order from inputs of different lengths in DataFrame

Time:06-02

First of all, I checked Q1 , Q2, Q3 on Stackoverflow but none of them is related to my question directly. I am using dataframe and order is important issue for me.

I'm creating a dummy dataframe to better explain my question.

data = {'seq': ['YSPNNIQHFHEEHLVHFVLAVLSLTTPPLLCVWNR','TLGTGSFGRVMLVHYAMKILDKVLQIEHTLNEKLVKLMVMEYVPGGEMFYDKPENLLIQVTDFGFAGFDY',
                'EKIGEGTYGVVYKVAMKVSLQLIFEFLSMDLKKHKPQNLLILADFL']}

dummy_df = pd.DataFrame(data)

My purpose is to find unique characters in the order of this string. Then saving them by creating a dataframe.

My expectation Output:

['Y', 'S', 'P', 'N', 'I', 'Q', 'H', 'F', 'E', 'L', 'V', 'A', 'T', 'C', 'W', 'R']
['T', 'L', 'G', 'S', 'F', 'R', 'V', 'M', 'H', 'Y', 'A', 'K', 'I', 'D', 'Q', 'E', 'N', 'P']
['E', 'K', 'I', 'G', 'T', 'Y', 'V', 'A', 'M', 'S', 'L', 'Q', 'F', 'D', 'H', 'P', 'N']

Actually, I have two solutions:

1)

unique_char = []
for c in dummy_df ['seq'][0]:
    if not c in unique_char:
        unique_char.append(c)
print(unique_char)

It's working but normally my row count is more than 1000. I get an error when I set up the for loop structure that will run separately for each. Then I need to this solution convert into a dataframe.

2

from collections import defaultdict
dictionary = defaultdict(set)

for i in range(len(dummy_df['seq']):
    for c in dummy_df ['seq'][i]:
        if not c in dictionary[i]:
          dictionary[i].add(c)

It works for all rows but the result is not in order.

For example, Output of Solution 2 For dictionary[0] (first row)

{'A',
 'C',
 'E',
 'F',
 'H',
 'I',
 'L',
 'N',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'V',
 'W',
 'Y'}

It should be

['Y', 'S', 'P', 'N', 'I', 'Q', 'H', 'F', 'E', 'L', 'V', 'A', 'T', 'C', 'W', 'R']

CodePudding user response:

You can actually find the unique items in order through dictionary.

for i in data['seq']:
    print(list(dict.fromkeys(i)))
    
['Y', 'S', 'P', 'N', 'I', 'Q', 'H', 'F', 'E', 'L', 'V', 'A', 'T', 'C', 'W', 'R']
['T', 'L', 'G', 'S', 'F', 'R', 'V', 'M', 'H', 'Y', 'A', 'K', 'I', 'D', 'Q', 'E', 'N', 'P']
['E', 'K', 'I', 'G', 'T', 'Y', 'V', 'A', 'M', 'S', 'L', 'Q', 'F', 'D', 'H', 'P', 'N']

And you can do a list comprehension to be added into the dataframe.

new_data = [list(dict.fromkeys(i)) for i in data['seq']]
  • Related