I have a dataframe with following columns
col1 col2 col3 col4
0 HP:0005709 ['HP:0001770'] Toe syndactyly SNOMEDCT_US:32113001, C0265660
1 HP:0005709 ['HP:0001780'] Abnormality of toe C2674738
2 EFO:0009136 ['HP:0001507'] Growth abnormality C0262361
I would like to explode "col4", i tried different ways to do it but nothing is working. The dtype of the column is "object".
My tries are following:
df.explode('col4')
df['col4']=df['col4'].str.split(',') df = df.set_index(['col2']).apply(pd.Series.explode).reset_index()
import ast df[['col4']] = df[['col4']].applymap(ast.literal_eval) df = df.apply(pd.Series.explode)
The expected output is:
col1 col2 col3 col4
0 HP:0005709 ['HP:0001770'] Toe syndactyly SNOMEDCT_US:32113001
0 HP:0005709 ['HP:0001770'] Toe syndactyly C0265660
1 HP:0005709 ['HP:0001780'] Abnormality of toe C2674738
2 EFO:0009136 ['HP:0001507'] Growth abnormality C0262361
CodePudding user response:
IIUC, try:
out = df.assign(**{'col5': df['col5'].str.split(', ')}).explode('col5')
print(out)
# Output
col1 col2 col3 col4 col5
0 0 HP:0005709 ['HP:0001770'] Toe syndactyly SNOMEDCT_US:32113001
0 0 HP:0005709 ['HP:0001770'] Toe syndactyly C0265660
1 1 HP:0005709 ['HP:0001780'] Abnormality of toe C2674738
2 2 EFO:0009136 ['HP:0001507'] Growth abnormality C0262361
CodePudding user response:
Your input data is confusing because it has 5 headers for only 4 columns (or is the index a "normal" column?). In order to explode col4
first split it to convert its element to list, then explode:
df['col4'] = df['col4'].str.split(',\s*', regex=True)
df = df.explode('col4')
Output:
col1 col2 col3 col4
0 HP:0005709 ['HP:0001770'] Toe syndactyly SNOMEDCT_US:32113001
0 HP:0005709 ['HP:0001770'] Toe syndactyly C0265660
1 HP:0005709 ['HP:0001780'] Abnormality of toe C2674738
2 EFO:0009136 ['HP:0001507'] Growth abnormality C0262361
CodePudding user response:
columns = ['col1', 'col2', 'col3', 'col4', 'col5']
data = [['0', 'HP:0005709', "['HP:0001770']", 'Toe syndactyly','SNOMEDCT_US:32113001, C0265660'],
['1', 'HP:0005709', "['HP:0001780']", 'Abnormality of toe', 'C2674738'],
['2', 'EFO:0009136', "['HP:0001507']", 'Growth abnormality', 'C0262361']]
df = pd.DataFrame(data,columns=columns)
print(df)
col1 col2 col3 col4 col5
0 0 HP:0005709 ['HP:0001770'] Toe syndactyly SNOMEDCT_US:32113001, C0265660
1 1 HP:0005709 ['HP:0001780'] Abnormality of toe C2674738
2 2 EFO:0009136 ['HP:0001507'] Growth abnormality C0262361
column = 'col5'
df[column] = df[column].str.split(',')
new_df = df.explode('col5')
print(new_df)
col1 col2 col3 col4 col5
0 0 HP:0005709 ['HP:0001770'] Toe syndactyly SNOMEDCT_US:32113001
0 0 HP:0005709 ['HP:0001770'] Toe syndactyly C0265660
1 1 HP:0005709 ['HP:0001780'] Abnormality of toe C2674738
2 2 EFO:0009136 ['HP:0001507'] Growth abnormality C0262361