Context: I'm trying to insert the keys from dictionary as columns and values as rows using from_dict
method, but it doesn't seem to be working. I have a bit more context as comments in the code below:
for line in range(1,20):
print("New line")
substitutions = "C241T,C3037T,C14408T,A23403G,C27046T,G28881A,G28882A,G28883C" #for simplicity purpose, we'll use this input for substitutions variable
#substitutions = final_df.iloc[line,2]
snv = [] #Empty list that will reset for every line
for content in substitutions.split(","):
reference = content[0] #get the 1st character. Example, in "C241T" it retrieves "C"
substitution = content[-1] #get the last character. Example, in "C241T" it retrieves "T"
output = "{0}>{1}".format(reference,substitution) #put in desired column output, example "C>T"
snv.append(output) #append the desired column to the list
dictionary = dict() #create a dictionary to get the counts for each output in the snv list
for key in snv:
dictionary[key] = dictionary.get(key,0) 1
for key in dictionary.keys():
if key not in pca_df.columns[:]: #if the key is not in the column of the dataframe, then add it
pca_df.from_dict(dictionary) #Make the keys from the dictionary the columns and place the counts on the respective line
This would be the desired output:
EDIT: the pca_df
has this format and I'd like to populate it with the desired output:
seqName clade
0 Wuhan/Hu-1/2019 19A
1 sample_1 20B
2 sample_2 20A
...
If substitutions is this "C241T,C3037T,C14408T,A23403G,C27046T,G28881A,G28882A,G28883C"
(sample_1) then the output on the dataframe should be :
seqName clade C>T A>G G>A G>C T>C C>A G>T A>T
0 Wuhan/Hu-1/2019 19A 0 0 0 0 0 0 0 0
1 sample_1 20B 4 1 2 1 0 0 0 0
Then iterate to the next line (sample_2 with substitutions as "C241T,C3037T,C14408T,A23403G,C29144T"
) and do the same:
seqName clade C>T A>G G>A G>C T>C C>A G>T A>T
0 Wuhan/Hu-1/2019 19A 0 0 0 0 0 0 0 0
1 sample_1 20B 4 1 2 1 0 0 0 0
2 sample_2 20B 4 1 0 0 0 0 0 0
etc.
Any help is very welcome! I'm fairly new to python so the code might not be the best.
CodePudding user response:
Adding this piece of code fixed the problem:
for key,value in dictionary.items():
pca_df.loc[line, key] = value
I'd still like to see other (quicker/better) solutions if anyone is interested. :) Doing this for 20k lines took 57 seconds and I might need to do this to millions of lines, so this needs to be optimized for certain.
CodePudding user response:
A general pointer:
You're iterating through the same data multiple times:
String->List
List->Dict
Dict->DataFrame
Either make your changes directly to the dataframe, or make a dictionary in one pass and then use pandas to convert it straight to a dataframe.
Pseudocode:
data_dict = ()
For seq in dataset:
# unclear what each row looks like, but if [SeqName, Clade, "Substitutions"]
row_dict = ()
for item in seq[2].split(","):
#no need to create separate list. Could make this one line:
value = item[0] item[-1]
# create or increment dictionary entry
row_dict(value) = row_dict.get(value, 0) 1
# Now add each row_dict to data_dict
data_dict(seq[0]) = row_dict
# Now build dataframe. It will fill missing values with NaN
Data_frame = pandas.from_dict(data_dict)
You could look at defaultdict or Counters (specialized dictionaries that make counting like you're doing easier).