split arrays as string in one column of pandas to multiple columns-CodePudding

I have data frame in the below format. The description is in string format.

file	description
x	[[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]
y	[[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'], dtype=object), array([0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457])]]

How can i convert data Frame into below format.

file	license	score
x	['MIT', 'MIT', 'MIT', 'MIT', 'MIT']	[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]
y	['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0']	[0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457]

Above is just an example. Data frame is very large.

CodePudding user response：

Update, If elements in the column as string format, you can find array with regex formula. (Note don't use eval, Why should exec() and eval() be avoided?)

import ast
new_cols = lambda x: pd.Series({'licence':ast.literal_eval(x[0]), 
                                'score':ast.literal_eval(x[1])})

df = df.join(df['m'].str.findall(r'\[[^\[\]]*\]').apply(new_cols)).drop('m', axis=1)
print(df)

Output:

    file    licence                                              score
0   x       ['MIT', 'MIT', 'MIT', 'MIT', 'MIT']                     [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1   y       ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', ...    [0.28552457, 0.28552457, 0.28552457, 0.2855245...

How regex formula find arrays: (find string start with [ and end with ] but in finding string should not have [ or ] to find all arrays.)

>>> import re
>>> re.findall(r'\[[^\[\]]*\]', "[[np.array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), np.array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]",)
["['MIT', 'MIT', 'MIT', 'MIT', 'MIT']",
 '[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]']

Old, You can create new column then join with old dataframe.

new_cols = lambda x: pd.Series({'licence':x[0][0], 'score':x[0][1]})
df = df.join(df['m'].apply(new_cols)).drop('m', axis=1)
print(df)

CodePudding user response：

Input:

  file                                        description
0    x  [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], d...
1    y  [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', '...

Doing:

import ast

df.description = (df.description.str.replace('array', '')
                    .str.replace(', dtype=object', '')
                    .apply(ast.literal_eval))
df[['license', 'score']] = [(x[0], x[1]) for x in df.description.str[0]]
df = df.drop('description', axis=1)
print(df)

Output:

  file                                            license                                              score
0    x                          [MIT, MIT, MIT, MIT, MIT]  [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1    y  [APSL-1.0, APSL-1.0, APSL-1.0, APSL-1.0, APSL-...  [0.28552457, 0.28552457, 0.28552457, 0.2855245...