I'm working with a dataset which has gene names and gene ids. Basically, ids are uniquely defined, while one name can correspond to multiple ids.
I use a list to contain all ids of a gene name and the dataframe looks like:
|GeneName|GeneID|
|Name_1|[ID_1, ID_2, ID_5]|
|Name_2|[ID_3, ID_4]|
All names and ids are strings, but some ids are missing and I use NaN
to represnt missing ones (not sure if this is a good practice either).
After saving the dataframe to a csv file and load it back, all lists containing gene ids are regarded as strings. I found a solution using:
pd.read_csv(fpath, converters={'GeneName': pd.eval, 'GeneID': pd.eval})
to load them as list, but I encounter
pandas.core.computation.ops.UndefinedVariableError: name 'NaN' is not defined
What is the best solution to deal with situation like this? Thanks.
CodePudding user response:
From the problem you described in the comments you can just use empty strings to indicate missing categories.
Then use pd.eval or ast.literal_eval:
import ast
ast.literal_eval('["ID_1", "ID_2", "", "", "ID_5"]')
>>['ID_1', 'ID_2', '', '', 'ID_5']
Important Note:
Use different ' and " for list string and list element strings