First of all, I checked Q1 , Q2, Q3 on Stackoverflow but none of them is related to my question directly. I am using dataframe and order is important issue for me.
I'm creating a dummy dataframe to better explain my question.
data = {'seq': ['YSPNNIQHFHEEHLVHFVLAVLSLTTPPLLCVWNR','TLGTGSFGRVMLVHYAMKILDKVLQIEHTLNEKLVKLMVMEYVPGGEMFYDKPENLLIQVTDFGFAGFDY',
'EKIGEGTYGVVYKVAMKVSLQLIFEFLSMDLKKHKPQNLLILADFL']}
dummy_df = pd.DataFrame(data)
My purpose is to find unique characters in the order of this string. Then saving them by creating a dataframe.
My expectation Output:
['Y', 'S', 'P', 'N', 'I', 'Q', 'H', 'F', 'E', 'L', 'V', 'A', 'T', 'C', 'W', 'R']
['T', 'L', 'G', 'S', 'F', 'R', 'V', 'M', 'H', 'Y', 'A', 'K', 'I', 'D', 'Q', 'E', 'N', 'P']
['E', 'K', 'I', 'G', 'T', 'Y', 'V', 'A', 'M', 'S', 'L', 'Q', 'F', 'D', 'H', 'P', 'N']
Actually, I have two solutions:
1)
unique_char = []
for c in dummy_df ['seq'][0]:
if not c in unique_char:
unique_char.append(c)
print(unique_char)
It's working but normally my row count is more than 1000. I get an error when I set up the for loop structure that will run separately for each. Then I need to this solution convert into a dataframe.
2
from collections import defaultdict
dictionary = defaultdict(set)
for i in range(len(dummy_df['seq']):
for c in dummy_df ['seq'][i]:
if not c in dictionary[i]:
dictionary[i].add(c)
It works for all rows but the result is not in order.
For example, Output of Solution 2 For dictionary[0] (first row)
{'A',
'C',
'E',
'F',
'H',
'I',
'L',
'N',
'P',
'Q',
'R',
'S',
'T',
'V',
'W',
'Y'}
It should be
['Y', 'S', 'P', 'N', 'I', 'Q', 'H', 'F', 'E', 'L', 'V', 'A', 'T', 'C', 'W', 'R']
CodePudding user response:
You can actually find the unique items in order through dictionary.
for i in data['seq']:
print(list(dict.fromkeys(i)))
['Y', 'S', 'P', 'N', 'I', 'Q', 'H', 'F', 'E', 'L', 'V', 'A', 'T', 'C', 'W', 'R']
['T', 'L', 'G', 'S', 'F', 'R', 'V', 'M', 'H', 'Y', 'A', 'K', 'I', 'D', 'Q', 'E', 'N', 'P']
['E', 'K', 'I', 'G', 'T', 'Y', 'V', 'A', 'M', 'S', 'L', 'Q', 'F', 'D', 'H', 'P', 'N']
And you can do a list comprehension to be added into the dataframe.
new_data = [list(dict.fromkeys(i)) for i in data['seq']]