How do you make lists the same length by adding in N/A into all of them?-CodePudding

I have a dictionary:

gene_table_comparison = {
    "index":[1,2,3,4,5],
    "GeneID_1":["a","b","c","d","e"],
    "Start_1":[100,200,300,400,500]
    "Function_1":["Bruh","","Dude","","Seriously"],
    "GeneID_2":[1,2,3],
    "Start_2":["x",y","z"],
    "Function_2":["Geez","","Deez"]
}

and I want to convert it into a data frame by using pd.DataFrame(gene_table_comparison).

It requires to make each of the lists the same length though, and I would like N/A s to be at the end of each list, but how do I do that? And what if they are each different/random lengths?

CodePudding user response：

This is not the most efficient code, but I think it does a good job of demonstrating the fundamentals of how to fix your issue.

First, you want to find the maximum list length:

max_length = 0
for col in gene_table_comparison.values():
    if len(col) > max_length:
        max_length = len(col)

Next, you can append nans to the lists until they are all the same length:

import numpy as np

for col in gene_table_comparison.values():
    for _ in range(max_length - len(col)):
        col.append(np.nan)

Taken together:

import numpy as np

max_length = 0
for col in gene_table_comparison.values():
    if len(col) > max_length:
        max_length = len(col)

for col in gene_table_comparison.values():
    for _ in range(max_length - len(col)):
        col.append(np.nan)

CodePudding user response：

Here's a one-liner that works:

gene_table_comparison = {
    "index":[1,2,3,4,5],
    "GeneID_1":["a","b","c","d","e"],
    "Start_1":[100,200,300,400,500],
    "Function_1":["Bruh","","Dude","","Seriously"],
    "GeneID_2":[1,2,3],
    "Start_2":["x","y","z"],
    "Function_2":["Geez","","Deez"]
}

dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in gene_table_comparison.items() })

CodePudding user response：

Not sure what's meant by N/A so will assume None.

Start by figuring out the longest list (value) length:

gene_table_comparison = {
    "index":[1,2,3,4,5],
    "GeneID_1":["a","b","c","d","e"],
    "Start_1":[100,200,300,400,500],
    "Function_1":["Bruh","","Dude","","Seriously"],
    "GeneID_2":[1,2,3],
    "Start_2":["x","y","z"],
    "Function_2":["Geez","","Deez"]
}

max_ = max(map(len, gene_table_comparison.values()))

Then enumerate the values and pad them as appropriate:

for v in gene_table_comparison.values():
    if (a := max_ - len(v)) > 0:
        v.extend([None]*a)

print(gene_table_comparison)

Output:

{'index': [1, 2, 3, 4, 5], 'GeneID_1': ['a', 'b', 'c', 'd', 'e'], 'Start_1': [100, 200, 300, 400, 500], 'Function_1': ['Bruh', '', 'Dude', '', 'Seriously'], 'GeneID_2': [1, 2, 3, None, None], 'Start_2': ['x', 'y', 'z', None, None], 'Function_2': ['Geez', '', 'Deez', None, None]}

CodePudding user response：

Here's an option, using a custom function which appends a dynamic placeholder:

gene_table_comparison = {
    "index":[1,2,3,4,5],
    "GeneID_1":["a","b","c","d","e"],
    "Start_1":[100,200,300,400,500],
    "Function_1": ["Bruh","","Dude","","Seriously"],
    "GeneID_2":[1,2,3],
    "Start_2":["x","y","z"],
    "Function_2":["Geez","","Deez"]
}

def fill_with_placeholder(d, placeholder):
    max_length = max(map(len, d.values()))
    fill_list = lambda l: l   [placeholder for _ in range(max_length - len(l))]
    return {
        key: fill_list(sublist) if len(sublist) < max_length else sublist for key, sublist in d.items()
    }

Result:

{
'index': [1, 2, 3, 4, 5], 'GeneID_1': ['a', 'b', 'c', 'd', 'e'],
'Start_1': [100, 200, 300, 400, 500], 'Function_1': ['Bruh', '',
'Dude', '', 'Seriously'], 'GeneID_2': [1, 2, 3, 'NA', 'NA'],
'Start_2': ['x', 'y', 'z', 'NA', 'NA'], 'Function_2': ['Geez', '',
'Deez', 'NA', 'NA']
}