Pandas DataFrame - What is the correct way to operate with multiple values in one cell?-CodePudding

I'm working with a dataset which has gene names and gene ids. Basically, ids are uniquely defined, while one name can correspond to multiple ids.

I use a list to contain all ids of a gene name and the dataframe looks like:

|GeneName|GeneID|
|Name_1|[ID_1, ID_2, ID_5]|
|Name_2|[ID_3, ID_4]|

All names and ids are strings, but some ids are missing and I use NaN to represnt missing ones (not sure if this is a good practice either).

After saving the dataframe to a csv file and load it back, all lists containing gene ids are regarded as strings. I found a solution using:

pd.read_csv(fpath, converters={'GeneName': pd.eval, 'GeneID': pd.eval})

to load them as list, but I encounter

pandas.core.computation.ops.UndefinedVariableError: name 'NaN' is not defined

What is the best solution to deal with situation like this? Thanks.

CodePudding user response：

From the problem you described in the comments you can just use empty strings to indicate missing categories.
Then use pd.eval or ast.literal_eval:

import ast
ast.literal_eval('["ID_1", "ID_2", "", "", "ID_5"]')

>>['ID_1', 'ID_2', '', '', 'ID_5']

Important Note:
Use different ' and " for list string and list element strings