I have a dictionary:
gene_table_comparison = {
"index":[1,2,3,4,5],
"GeneID_1":["a","b","c","d","e"],
"Start_1":[100,200,300,400,500]
"Function_1":["Bruh","","Dude","","Seriously"],
"GeneID_2":[1,2,3],
"Start_2":["x",y","z"],
"Function_2":["Geez","","Deez"]
}
and I want to convert it into a data frame by using pd.DataFrame(gene_table_comparison)
.
It requires to make each of the lists the same length though, and I would like N/A s to be at the end of each list, but how do I do that? And what if they are each different/random lengths?
CodePudding user response:
This is not the most efficient code, but I think it does a good job of demonstrating the fundamentals of how to fix your issue.
First, you want to find the maximum list length:
max_length = 0
for col in gene_table_comparison.values():
if len(col) > max_length:
max_length = len(col)
Next, you can append nans to the lists until they are all the same length:
import numpy as np
for col in gene_table_comparison.values():
for _ in range(max_length - len(col)):
col.append(np.nan)
Taken together:
import numpy as np
max_length = 0
for col in gene_table_comparison.values():
if len(col) > max_length:
max_length = len(col)
for col in gene_table_comparison.values():
for _ in range(max_length - len(col)):
col.append(np.nan)
CodePudding user response:
Here's a one-liner that works:
gene_table_comparison = {
"index":[1,2,3,4,5],
"GeneID_1":["a","b","c","d","e"],
"Start_1":[100,200,300,400,500],
"Function_1":["Bruh","","Dude","","Seriously"],
"GeneID_2":[1,2,3],
"Start_2":["x","y","z"],
"Function_2":["Geez","","Deez"]
}
dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in gene_table_comparison.items() })
CodePudding user response:
Not sure what's meant by N/A so will assume None.
Start by figuring out the longest list (value) length:
gene_table_comparison = {
"index":[1,2,3,4,5],
"GeneID_1":["a","b","c","d","e"],
"Start_1":[100,200,300,400,500],
"Function_1":["Bruh","","Dude","","Seriously"],
"GeneID_2":[1,2,3],
"Start_2":["x","y","z"],
"Function_2":["Geez","","Deez"]
}
max_ = max(map(len, gene_table_comparison.values()))
Then enumerate the values and pad them as appropriate:
for v in gene_table_comparison.values():
if (a := max_ - len(v)) > 0:
v.extend([None]*a)
print(gene_table_comparison)
Output:
{'index': [1, 2, 3, 4, 5], 'GeneID_1': ['a', 'b', 'c', 'd', 'e'], 'Start_1': [100, 200, 300, 400, 500], 'Function_1': ['Bruh', '', 'Dude', '', 'Seriously'], 'GeneID_2': [1, 2, 3, None, None], 'Start_2': ['x', 'y', 'z', None, None], 'Function_2': ['Geez', '', 'Deez', None, None]}
CodePudding user response:
Here's an option, using a custom function which appends a dynamic placeholder:
gene_table_comparison = {
"index":[1,2,3,4,5],
"GeneID_1":["a","b","c","d","e"],
"Start_1":[100,200,300,400,500],
"Function_1": ["Bruh","","Dude","","Seriously"],
"GeneID_2":[1,2,3],
"Start_2":["x","y","z"],
"Function_2":["Geez","","Deez"]
}
def fill_with_placeholder(d, placeholder):
max_length = max(map(len, d.values()))
fill_list = lambda l: l [placeholder for _ in range(max_length - len(l))]
return {
key: fill_list(sublist) if len(sublist) < max_length else sublist for key, sublist in d.items()
}
Result:
{
'index': [1, 2, 3, 4, 5], 'GeneID_1': ['a', 'b', 'c', 'd', 'e'],
'Start_1': [100, 200, 300, 400, 500], 'Function_1': ['Bruh', '',
'Dude', '', 'Seriously'], 'GeneID_2': [1, 2, 3, 'NA', 'NA'],
'Start_2': ['x', 'y', 'z', 'NA', 'NA'], 'Function_2': ['Geez', '',
'Deez', 'NA', 'NA']
}