Home > Mobile >  split arrays as string in one column of pandas to multiple columns
split arrays as string in one column of pandas to multiple columns

Time:07-10

I have data frame in the below format. The description is in string format.

file description
x [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]
y [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'], dtype=object), array([0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457])]]

How can i convert data Frame into below format.

file license score
x ['MIT', 'MIT', 'MIT', 'MIT', 'MIT'] [0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]
y ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'] [0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457]

​Above is just an example. Data frame is very large.

CodePudding user response:

Update, If elements in the column as string format, you can find array with regex formula. (Note don't use eval, Why should exec() and eval() be avoided?)

import ast
new_cols = lambda x: pd.Series({'licence':ast.literal_eval(x[0]), 
                                'score':ast.literal_eval(x[1])})

df = df.join(df['m'].str.findall(r'\[[^\[\]]*\]').apply(new_cols)).drop('m', axis=1)
print(df)

Output:

    file    licence                                              score
0   x       ['MIT', 'MIT', 'MIT', 'MIT', 'MIT']                     [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1   y       ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', ...    [0.28552457, 0.28552457, 0.28552457, 0.2855245...

How regex formula find arrays: (find string start with [ and end with ] but in finding string should not have [ or ] to find all arrays.)

>>> import re
>>> re.findall(r'\[[^\[\]]*\]', "[[np.array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), np.array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]",)
["['MIT', 'MIT', 'MIT', 'MIT', 'MIT']",
 '[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]']

Old, You can create new column then join with old dataframe.

new_cols = lambda x: pd.Series({'licence':x[0][0], 'score':x[0][1]})
df = df.join(df['m'].apply(new_cols)).drop('m', axis=1)
print(df)

CodePudding user response:

Input:

  file                                        description
0    x  [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], d...
1    y  [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', '...

Doing:

import ast

df.description = (df.description.str.replace('array', '')
                    .str.replace(', dtype=object', '')
                    .apply(ast.literal_eval))
df[['license', 'score']] = [(x[0], x[1]) for x in df.description.str[0]]
df = df.drop('description', axis=1)
print(df)

Output:

  file                                            license                                              score
0    x                          [MIT, MIT, MIT, MIT, MIT]  [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1    y  [APSL-1.0, APSL-1.0, APSL-1.0, APSL-1.0, APSL-...  [0.28552457, 0.28552457, 0.28552457, 0.2855245...
  • Related