I have a dataframe with hundreds of columns like the example below:
1 12 13 14 15
id=10 formatted_value=U$ 20.000 weighted_value=U$ 20000 person_name=Natys Person query={'id':0,'name':'Robert'}
id=11 formatted_value=U$ 10.000 weighted_value=U$ 10000 person_name=Mike Tyson query={'id':2,'name':'Roberta'}
id=12 formatted_value=U$ 18.000 weighted_value=U$ 10000 person_name=Mike Talbud sometext
I want to perform an operation to rename these columns with the string tha comes before the =
sign in each row and then delete the string=
All rows has this same pattern beginning with string=
The output should be:
id formatted_value weighted_value person_name query
10 U$ 20.000 U$ 20000 Natys Person {'id':0,'name':'Robert'}
11 U$ 10.000 U$ 10000 Mike Tyson {'id':0,'name':'Robert'}
12 U$ 18.000 U$ 10000 Mike Talbud sometext
Tried some approachs but I failed to do it.
CodePudding user response:
Using a stack
/extract
/unstack
vectorial approach:
out = (df
.stack().droplevel(1)
.str.extract('([^=] )=\s*(.*)')
.set_index(0, append=True)[1]
.unstack().rename_axis(columns=None)
)
output:
formatted_value id person_name query weighted_value
0 U$ 20.000 10 Natys Person {'id':0,'name':'Robert'} U$ 20000
1 U$ 10.000 11 Mike Tyson {'id':2,'name':'Roberta'} U$ 10000
handling duplicate index
we can reset the index and restore it later:
out = (df
.reset_index(drop=True)
.stack().droplevel(1)
.str.extract('([^=] )=\s*(.*)')
.set_index(0, append=True)[1]
.unstack().rename_axis(columns=None)
.set_axis(df.index)
)
CodePudding user response:
pd.DataFrame(df.apply(lambda x:dict(list(x.str.split('='))),axis=1).to_list())
id formatted_value weighted_value person_name
0 10 U$ 20.000 U$ 20000 Natys Person
1 11 U$ 10.000 U$ 10000 Mike Tyson
with the dictionaries:
pd.DataFrame(df.apply(lambda x:dict(list(x.str.split('='))),axis=1).to_list())
id formatted_value weighted_value person_name query
0 10 U$ 20.000 U$ 20000 Natys Person {'id':0,'name':'Robert'}
1 11 U$ 10.000 U$ 10000 Mike Tyson {'id':2,'name':'Roberta'}
where
df = pd.DataFrame({'1': {0: 'id=10', 1: 'id=11'},
'12': {0: 'formatted_value=U$ 20.000', 1: 'formatted_value=U$ 10.000'},
'13': {0: 'weighted_value=U$ 20000', 1: 'weighted_value=U$ 10000'},
'14': {0: 'person_name=Natys Person', 1: 'person_name=Mike Tyson'},
'15': {0: "query={'id':0,'name':'Robert'}",
1: "query={'id':2,'name':'Roberta'}"}})
EDIT:
pd.DataFrame(df.apply(lambda x:dict(list(x.str.replace('(^[^=] $)', 'one_text=\\1', regex=True).str.split('='))),axis=1).to_list())
id formatted_value ... query one_text
0 10 U$ 20.000 ... {'id':0,'name':'Robert'} NaN
1 11 U$ 10.000 ... {'id':2,'name':'Roberta'} NaN
2 12 U$ 18.000 ... NaN sometext
CodePudding user response:
You can try split the value into dict and convert each dictionary column to separate column
df = df.applymap(lambda x: dict([x.split('=')]))
out = pd.concat([df[col].apply(pd.Series) for col in df.columns], axis=1)
print(out)
id formatted_value weighted_value person_name
0 10 U$ 20.000 U$ 20000 Natys Person
1 11 U$ 10.000 U$ 10000 Mike Tyson
CodePudding user response:
I guess in some columns containing more than 1 =
, so try this solution
import pandas as pd
df = pd.DataFrame({'1': {0: 'id=10', 1: 'id=11'},
'12': {0: 'formatted_value=U$ 20.000', 1: 'formatted_value=U$ 10.000'},
'13': {0: 'weighted_value=U$ 20000', 1: 'weighted_value=U$ 10000'},
'14': {0: 'person_name=Natys Person', 1: 'person_name=Mike Tyson'},
'15': {0: "query={'id':0,'name':'Robert'}",
1: "query={'id':2,'name':'Roberta'}"}})
for column in df.columns:
sample_value = df.loc[0,column]
new_column_name = sample_value.split("=")[0]
df.rename(columns={column: new_column_name}, inplace=True)
df.loc[:, new_column_name] = df[new_column_name].apply(lambda x: "=".join(x.split("=")[1:]))
df