Home > database >  Explode is not working on pandas dataframe
Explode is not working on pandas dataframe

Time:01-19

I have a dataframe with following columns

    col1        col2            col3                  col4            
0   HP:0005709  ['HP:0001770']  Toe syndactyly        SNOMEDCT_US:32113001, C0265660
1   HP:0005709  ['HP:0001780']  Abnormality of toe    C2674738
2   EFO:0009136 ['HP:0001507']  Growth abnormality    C0262361

I would like to explode "col4", i tried different ways to do it but nothing is working. The dtype of the column is "object".

My tries are following:

  1. df.explode('col4')

df['col4']=df['col4'].str.split(',') df = df.set_index(['col2']).apply(pd.Series.explode).reset_index()

  1. import ast df[['col4']] = df[['col4']].applymap(ast.literal_eval) df = df.apply(pd.Series.explode)

The expected output is:

    col1        col2            col3                col4                
0   HP:0005709  ['HP:0001770']  Toe syndactyly      SNOMEDCT_US:32113001
0   HP:0005709  ['HP:0001770']  Toe syndactyly      C0265660
1   HP:0005709  ['HP:0001780']  Abnormality of toe  C2674738
2   EFO:0009136 ['HP:0001507']  Growth abnormality  C0262361

CodePudding user response:

IIUC, try:

out = df.assign(**{'col5': df['col5'].str.split(', ')}).explode('col5')
print(out)

# Output
   col1         col2            col3                col4                  col5
0     0   HP:0005709  ['HP:0001770']      Toe syndactyly  SNOMEDCT_US:32113001
0     0   HP:0005709  ['HP:0001770']      Toe syndactyly              C0265660
1     1   HP:0005709  ['HP:0001780']  Abnormality of toe              C2674738
2     2  EFO:0009136  ['HP:0001507']  Growth abnormality              C0262361

CodePudding user response:

Your input data is confusing because it has 5 headers for only 4 columns (or is the index a "normal" column?). In order to explode col4 first split it to convert its element to list, then explode:

df['col4'] = df['col4'].str.split(',\s*', regex=True)
df = df.explode('col4')

Output:

          col1            col2                col3                  col4
0   HP:0005709  ['HP:0001770']      Toe syndactyly  SNOMEDCT_US:32113001
0   HP:0005709  ['HP:0001770']      Toe syndactyly              C0265660
1   HP:0005709  ['HP:0001780']  Abnormality of toe              C2674738
2  EFO:0009136  ['HP:0001507']  Growth abnormality              C0262361

CodePudding user response:

columns = ['col1', 'col2', 'col3', 'col4', 'col5']
data = [['0', 'HP:0005709', "['HP:0001770']", 'Toe syndactyly','SNOMEDCT_US:32113001, C0265660'],
        ['1', 'HP:0005709', "['HP:0001780']", 'Abnormality of toe', 'C2674738'],
        ['2', 'EFO:0009136', "['HP:0001507']", 'Growth abnormality', 'C0262361']]

df = pd.DataFrame(data,columns=columns)
print(df)
    col1    col2        col3            col4                col5
0   0       HP:0005709  ['HP:0001770']  Toe syndactyly      SNOMEDCT_US:32113001, C0265660
1   1       HP:0005709  ['HP:0001780']  Abnormality of toe  C2674738
2   2       EFO:0009136 ['HP:0001507']  Growth abnormality  C0262361
column = 'col5'
df[column] = df[column].str.split(',')
new_df = df.explode('col5')
print(new_df)
    col1    col2        col3            col4                col5
0   0       HP:0005709  ['HP:0001770']  Toe syndactyly      SNOMEDCT_US:32113001
0   0       HP:0005709  ['HP:0001770']  Toe syndactyly      C0265660
1   1       HP:0005709  ['HP:0001780']  Abnormality of toe  C2674738
2   2       EFO:0009136 ['HP:0001507']  Growth abnormality  C0262361
  • Related