I have a list of lists of strings containing the taxonomies of different bacterial species. Each list has a consistent format:
['d__domain;p__phylum;c__class;o__order;f__family;g__genus;s__species','...','...']
I'm trying to pull out the genera of each string in each list to find the unique genera. To do this, my idea was to make nested for loops that would split each string by ';' and use list comprehension to search for 'g__', then lstrip off the g__ and append the associated genus name to a new, complimentary list. I attempted this in the code below:
finalList = []
for i in range(32586):
outputList = []
j = 0
for j in taxonomyData.loc[i,'GTDB Taxonomy'][j]:
## Access Taxonomy column of Pandas dataframe and split by ;
taxa = taxonomyData.loc[i,'GTDB Taxonomy'].split(';')
## Use list comprehension to pull out genus by g__
genus = [x for x in taxa if 'g__' in x]
if genus == [] :
genus = 'None'
## lstrip off g__
else:
genus = genus[0].lstrip('g__')
## Append genus to new list of genera
outputList.append(genus)
## Append new list of genera to larger list
finalList.append(outputList)
print(finalList)
break
print(genus)
I tested this for loop and it successfully pulled the genus out of the first string of the first list, but when I let the for loop run, it skipped to the next list, leaving all the other items in the first list. Any advice on how I can get this loop to iterate through all the strings in the first list before moving on to subsequent lists?
Solved
Final Code:
finalList = []
for i in range(32586):
## Access Taxonomy column of Pandas dataframe and split by ;
if pd.isna(taxonomyData.loc[i,'GTDB Taxonomy']) == True :
genus_unique = ['None']
finalList.append(genus_unique)
else:
taxa = taxonomyData.loc[i,'GTDB Taxonomy'].split(';')
## Use list comprehension to pull out genus by g__
genus_unique = {x[3:] for x in taxa if x.startswith('g__')}
genus_unique = list(genus_unique)
## Append new list of genera to larger list
finalList.append(genus_unique)
print(finalList)
CodePudding user response:
Here's how you can get unique genus entries from a list with a single set comprehension:
taxa = ['d__abc', 'g__def', 'p__ghi', 'g__jkl', 'd__abc', 'g__def']
genus_unique = {x[3:] for x in taxa if x.startswith('g__')}
print(genus_unique)
Result:
{'def', 'jkl'}
You can also convert it into a list afterwards with list(genus_unique)
if you need that.