Populate a dataframe column and rows using keys and values from a dictionary, row by row-CodePudding

Context: I'm trying to insert the keys from dictionary as columns and values as rows using from_dict method, but it doesn't seem to be working. I have a bit more context as comments in the code below:

for line in range(1,20): 
    print("New line")
    substitutions = "C241T,C3037T,C14408T,A23403G,C27046T,G28881A,G28882A,G28883C" #for simplicity purpose, we'll use this input for substitutions variable
    #substitutions = final_df.iloc[line,2] 
    snv = [] #Empty list that will reset for every line 

    for content in substitutions.split(","):
        reference = content[0] #get the 1st character. Example, in "C241T" it retrieves "C"
        substitution = content[-1] #get the last character. Example, in "C241T" it retrieves "T"
        output = "{0}>{1}".format(reference,substitution) #put in desired column output, example "C>T"

        snv.append(output) #append the desired column to the list

    dictionary = dict() #create a dictionary to get the counts for each output in the snv list
    for key in snv:
        dictionary[key] = dictionary.get(key,0)   1

    for key in dictionary.keys():
        if key not in pca_df.columns[:]: #if the key is not in the column of the dataframe, then add it
            pca_df.from_dict(dictionary) #Make the keys from the dictionary the columns and place the counts on the respective line

This would be the desired output:

EDIT: the pca_df has this format and I'd like to populate it with the desired output:

    seqName            clade 
0   Wuhan/Hu-1/2019     19A 
1   sample_1            20B 
2   sample_2            20A 
...

If substitutions is this "C241T,C3037T,C14408T,A23403G,C27046T,G28881A,G28882A,G28883C" (sample_1) then the output on the dataframe should be :

    seqName            clade    C>T A>G G>A G>C T>C C>A G>T A>T
0   Wuhan/Hu-1/2019     19A     0   0   0   0   0   0   0   0
1   sample_1            20B     4   1   2   1   0   0   0   0

Then iterate to the next line (sample_2 with substitutions as "C241T,C3037T,C14408T,A23403G,C29144T") and do the same:

    seqName            clade    C>T A>G G>A G>C T>C C>A G>T A>T
0   Wuhan/Hu-1/2019     19A     0   0   0   0   0   0   0   0
1   sample_1            20B     4   1   2   1   0   0   0   0
2   sample_2            20B     4   1   0   0   0   0   0   0

etc.

Any help is very welcome! I'm fairly new to python so the code might not be the best.

CodePudding user response：

Adding this piece of code fixed the problem:

for key,value in dictionary.items():
    pca_df.loc[line, key] = value

I'd still like to see other (quicker/better) solutions if anyone is interested. :) Doing this for 20k lines took 57 seconds and I might need to do this to millions of lines, so this needs to be optimized for certain.

CodePudding user response：

A general pointer:

You're iterating through the same data multiple times:

String->List

List->Dict

Dict->DataFrame

Either make your changes directly to the dataframe, or make a dictionary in one pass and then use pandas to convert it straight to a dataframe.

Pseudocode:

    data_dict = ()
    For seq in dataset:
        # unclear what each row looks like, but if [SeqName, Clade, "Substitutions"]
        row_dict = ()
        for item in seq[2].split(","):
            #no need to create separate list. Could make this one line:
            value = item[0] item[-1]
            # create or increment dictionary entry
            row_dict(value) = row_dict.get(value, 0)   1
        # Now add each row_dict to data_dict
        data_dict(seq[0]) = row_dict

    # Now build dataframe. It will fill missing values with NaN
    Data_frame = pandas.from_dict(data_dict)

You could look at defaultdict or Counters (specialized dictionaries that make counting like you're doing easier).