How to create new columns based on values before delimiter and add values after the delimiter in Pan-CodePudding

I am reading a .txt in Pandas. I want to create new column names based on values before '=>' and add values in that column, based on value after '=>'

Input:

other_tags
"access"=>"agricultural","agricultural"=>"no"
"access"=>"customers"
"access"=>"customers","amenity"=>"parking"
"description"=>"GRAND PERE DE MON AMI"
"description"=>"GRAND PERE"
"design"=>"asymmetric","power"=>"tower"
"design"=>"asymmetric","power"=>"tower"

code:

import pandas as pd
df1 = pd.read_csv('try.txt', sep ='\t')
df1['access_type'] = df1['other_tags'].get('access')
df1['amenity'] = df1['other_tags'].get('amenity')
print(df1)

expected_output:

other_tags                                         access_type                            amenity
"access"=>"agricultural","agricultural"=>"no"      agricultural 
"access"=>"customers"                              customers    
"access"=>"customers","amenity"=>"parking"         customers                              parking
"description"=>"GRAND PERE DE MON AMI"      
"description"=>"GRAND PERE"     
"design"=>"asymmetric","power"=>"tower"     
"design"=>"asymmetric","power"=>"tower"

CodePudding user response：

Explode and expand your column other_tags then pivot it and finally join it to your original dataframe.

df = df.join(df['other_tags'].str.replace('"', '')
                 .str.split(',').explode().str.split('=>', expand=True)
                 .reset_index().pivot('index', 0, 1).fillna(''))

Output

>>> df
                                      other_tags        access agricultural  amenity            description      design  power
0  "access"=>"agricultural","agricultural"=>"no"  agricultural           no                                                   
1                          "access"=>"customers"     customers                                                                
2     "access"=>"customers","amenity"=>"parking"     customers               parking                                          
3         "description"=>"GRAND PERE DE MON AMI"                                      GRAND PERE DE MON AMI                   
4                    "description"=>"GRAND PERE"                                                 GRAND PERE                   
5        "design"=>"asymmetric","power"=>"tower"                                                             asymmetric  tower
6        "design"=>"asymmetric","power"=>"tower"                                                             asymmetric  tower

If you want to keep only the 2 columns access and amenity:

COLS_TO_KEEP = ['access', 'amenity']

df = df.join(df['other_tags'].str.replace('"', '')
                 .str.split(',').explode().str.split('=>', expand=True)
                 .reset_index().pivot('index', 0, 1)[COLS_TO_KEEP].fillna(''))

Output:

>>> df
                                      other_tags        access  amenity
0  "access"=>"agricultural","agricultural"=>"no"  agricultural         
1                          "access"=>"customers"     customers         
2     "access"=>"customers","amenity"=>"parking"     customers  parking
3         "description"=>"GRAND PERE DE MON AMI"                       
4                    "description"=>"GRAND PERE"                       
5        "design"=>"asymmetric","power"=>"tower"                       
6        "design"=>"asymmetric","power"=>"tower"