Home > database >  Renaming columns according to a string in the rows of a dataframe
Renaming columns according to a string in the rows of a dataframe

Time:06-06

I have a dataframe with hundreds of columns like the example below:

  1              12                             13                            14                  15
               
id=10    formatted_value=U$ 20.000    weighted_value=U$ 20000    person_name=Natys Person    query={'id':0,'name':'Robert'}
id=11    formatted_value=U$ 10.000    weighted_value=U$ 10000    person_name=Mike Tyson      query={'id':2,'name':'Roberta'}
id=12    formatted_value=U$ 18.000    weighted_value=U$ 10000    person_name=Mike Talbud     sometext

I want to perform an operation to rename these columns with the string tha comes before the = sign in each row and then delete the string=

All rows has this same pattern beginning with string=

The output should be:

   id          formatted_value    weighted_value    person_name     query                          
    10            U$ 20.000          U$ 20000         Natys Person  {'id':0,'name':'Robert'}    
    11            U$ 10.000          U$ 10000         Mike Tyson    {'id':0,'name':'Robert'}
    12            U$ 18.000          U$ 10000         Mike Talbud   sometext

Tried some approachs but I failed to do it.

CodePudding user response:

Using a stack/extract/unstack vectorial approach:

out = (df
  .stack().droplevel(1)
  .str.extract('([^=] )=\s*(.*)')
  .set_index(0, append=True)[1]
  .unstack().rename_axis(columns=None)
)

output:

  formatted_value  id   person_name                      query weighted_value
0       U$ 20.000  10  Natys Person   {'id':0,'name':'Robert'}       U$ 20000
1       U$ 10.000  11    Mike Tyson  {'id':2,'name':'Roberta'}       U$ 10000

handling duplicate index

we can reset the index and restore it later:

out = (df
  .reset_index(drop=True)
  .stack().droplevel(1)
  .str.extract('([^=] )=\s*(.*)')
  .set_index(0, append=True)[1]
  .unstack().rename_axis(columns=None)
  .set_axis(df.index)
)

CodePudding user response:

pd.DataFrame(df.apply(lambda x:dict(list(x.str.split('='))),axis=1).to_list())

   id formatted_value weighted_value   person_name
0  10       U$ 20.000       U$ 20000  Natys Person
1  11       U$ 10.000       U$ 10000    Mike Tyson

with the dictionaries:

pd.DataFrame(df.apply(lambda x:dict(list(x.str.split('='))),axis=1).to_list())

   id formatted_value weighted_value   person_name                      query
0  10       U$ 20.000       U$ 20000  Natys Person   {'id':0,'name':'Robert'}
1  11       U$ 10.000       U$ 10000    Mike Tyson  {'id':2,'name':'Roberta'}

where

df = pd.DataFrame({'1': {0: 'id=10', 1: 'id=11'},
 '12': {0: 'formatted_value=U$ 20.000', 1: 'formatted_value=U$ 10.000'},
 '13': {0: 'weighted_value=U$ 20000', 1: 'weighted_value=U$ 10000'},
 '14': {0: 'person_name=Natys Person', 1: 'person_name=Mike Tyson'},
 '15': {0: "query={'id':0,'name':'Robert'}",
  1: "query={'id':2,'name':'Roberta'}"}})

EDIT:

pd.DataFrame(df.apply(lambda x:dict(list(x.str.replace('(^[^=] $)', 'one_text=\\1', regex=True).str.split('='))),axis=1).to_list())
 
   id formatted_value  ...                      query  one_text
0  10       U$ 20.000  ...   {'id':0,'name':'Robert'}       NaN
1  11       U$ 10.000  ...  {'id':2,'name':'Roberta'}       NaN
2  12       U$ 18.000  ...                        NaN  sometext

CodePudding user response:

You can try split the value into dict and convert each dictionary column to separate column

df = df.applymap(lambda x: dict([x.split('=')]))
out = pd.concat([df[col].apply(pd.Series) for col in df.columns], axis=1)
print(out)

   id formatted_value weighted_value   person_name
0  10       U$ 20.000       U$ 20000  Natys Person
1  11       U$ 10.000       U$ 10000    Mike Tyson

CodePudding user response:

I guess in some columns containing more than 1 =, so try this solution

import pandas as pd
df = pd.DataFrame({'1': {0: 'id=10', 1: 'id=11'},
 '12': {0: 'formatted_value=U$ 20.000', 1: 'formatted_value=U$ 10.000'},
 '13': {0: 'weighted_value=U$ 20000', 1: 'weighted_value=U$ 10000'},
 '14': {0: 'person_name=Natys Person', 1: 'person_name=Mike Tyson'},
 '15': {0: "query={'id':0,'name':'Robert'}",
  1: "query={'id':2,'name':'Roberta'}"}})

for column in df.columns:
    sample_value = df.loc[0,column]
    new_column_name = sample_value.split("=")[0]
    df.rename(columns={column: new_column_name}, inplace=True)
    df.loc[:, new_column_name] = df[new_column_name].apply(lambda x: "=".join(x.split("=")[1:]))
df
  • Related