Home > Software design >  Pandas DataFrame - What is the correct way to operate with multiple values in one cell?
Pandas DataFrame - What is the correct way to operate with multiple values in one cell?

Time:12-28

I'm working with a dataset which has gene names and gene ids. Basically, ids are uniquely defined, while one name can correspond to multiple ids.

I use a list to contain all ids of a gene name and the dataframe looks like:

|GeneName|GeneID|
|Name_1|[ID_1, ID_2, ID_5]|
|Name_2|[ID_3, ID_4]|

All names and ids are strings, but some ids are missing and I use NaN to represnt missing ones (not sure if this is a good practice either).

After saving the dataframe to a csv file and load it back, all lists containing gene ids are regarded as strings. I found a solution using:

pd.read_csv(fpath, converters={'GeneName': pd.eval, 'GeneID': pd.eval})

to load them as list, but I encounter

pandas.core.computation.ops.UndefinedVariableError: name 'NaN' is not defined

What is the best solution to deal with situation like this? Thanks.

CodePudding user response:

From the problem you described in the comments you can just use empty strings to indicate missing categories.
Then use pd.eval or ast.literal_eval:

import ast
ast.literal_eval('["ID_1", "ID_2", "", "", "ID_5"]')

>>['ID_1', 'ID_2', '', '', 'ID_5']

Important Note:
Use different ' and " for list string and list element strings

  • Related