I have data frame in the below format. The description is in string format.
file | description |
---|---|
x | [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]] |
y | [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'], dtype=object), array([0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457])]] |
How can i convert data Frame into below format.
file | license | score |
---|---|---|
x | ['MIT', 'MIT', 'MIT', 'MIT', 'MIT'] | [0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217] |
y | ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'] | [0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457] |
Above is just an example. Data frame is very large.
CodePudding user response:
Update, If elements in the column as string format, you can find array with regex
formula. (Note don't use eval, Why should exec() and eval() be avoided?)
import ast
new_cols = lambda x: pd.Series({'licence':ast.literal_eval(x[0]),
'score':ast.literal_eval(x[1])})
df = df.join(df['m'].str.findall(r'\[[^\[\]]*\]').apply(new_cols)).drop('m', axis=1)
print(df)
Output:
file licence score
0 x ['MIT', 'MIT', 'MIT', 'MIT', 'MIT'] [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1 y ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', ... [0.28552457, 0.28552457, 0.28552457, 0.2855245...
How regex
formula find arrays: (find string start with [
and end with ]
but in finding string should not have [
or ]
to find all arrays.)
>>> import re
>>> re.findall(r'\[[^\[\]]*\]', "[[np.array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), np.array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]",)
["['MIT', 'MIT', 'MIT', 'MIT', 'MIT']",
'[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]']
Old, You can create new column then join with old dataframe
.
new_cols = lambda x: pd.Series({'licence':x[0][0], 'score':x[0][1]})
df = df.join(df['m'].apply(new_cols)).drop('m', axis=1)
print(df)
CodePudding user response:
Input:
file description
0 x [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], d...
1 y [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', '...
Doing:
import ast
df.description = (df.description.str.replace('array', '')
.str.replace(', dtype=object', '')
.apply(ast.literal_eval))
df[['license', 'score']] = [(x[0], x[1]) for x in df.description.str[0]]
df = df.drop('description', axis=1)
print(df)
Output:
file license score
0 x [MIT, MIT, MIT, MIT, MIT] [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1 y [APSL-1.0, APSL-1.0, APSL-1.0, APSL-1.0, APSL-... [0.28552457, 0.28552457, 0.28552457, 0.2855245...