I'm working on a legacy database with a table that looks like this:
Account | Key1 | Key2 | Key3 | Val1 | Val2 | Val3 |
---|---|---|---|---|---|---|
1 | Home | Work | 555-555-1111 | 555-555-2222 | ||
2 | Home | 555-555-3333 | ||||
3 | Mobile | Work | Home | 555-555-4444 | 555-555-5555 | 555-555-6666 |
I'd like to transform that into a properly normalized form in Pandas. The desired output looks like this:
Account | PhoneType | PhoneNumber |
---|---|---|
1 | Home | 555-555-1111 |
1 | Work | 555-555-2222 |
2 | Home | 555-555-3333 |
3 | Mobile | 555-555-4444 |
3 | Work | 555-555-5555 |
3 | Home | 555-555-6666 |
The following code will create a dataframe to start with:
import pandas as pd
df = pd.DataFrame([
{
"Account": "1",
"Key1": "Home",
"Key2": "Work",
"Val1": "555-555-1111",
"Val2": "555-555-2222"
},
{
"Account": "2",
"Key1": "Home",
"Val1": "555-555-3333"
},
{
"Account": "3",
"Key1": "Mobile",
"Key2": "Work",
"Key3": "Home",
"Val1": "555-555-4444",
"Val2": "555-555-5555",
"Val3": "555-555-6666"
}
])
What's the cleanest / most efficient way to transform the dataframe as indicated above?
CodePudding user response:
You could use wide-to-long for this
(
pd.wide_to_long(df,
['Key','Val'],
i='Account', j='n')
.dropna()
.reset_index()
.drop(columns='n')
.sort_values(by='Account')
)
Output
Account Key Val
0 1 Home 555-555-1111
3 1 Work 555-555-2222
1 2 Home 555-555-3333
2 3 Mobile 555-555-4444
4 3 Work 555-555-5555
5 3 Home 555-555-6666
CodePudding user response:
Personally, I would do the transformation before moving to pandas. Many transforms like this seem more straightforward outside of pandas.
import pandas as pd
data = [
{
"Account": "1",
"Key1": "Home",
"Key2": "Work",
"Val1": "555-555-1111",
"Val2": "555-555-2222"
},
{
"Account": "2",
"Key1": "Home",
"Val1": "555-555-3333"
},
{
"Account": "3",
"Key1": "Mobile",
"Key2": "Work",
"Key3": "Home",
"Val1": "555-555-4444",
"Val2": "555-555-5555",
"Val3": "555-555-6666"
}
]
rows = []
for row in data:
for n in "123":
if f"Key{n}" in row:
rows.append((row['Account'],row[f'Key{n}'],row[f'Val{n}']))
df = pd.DataFrame(rows, columns=["Account","PhoneType","PhoneNumber"])
print(df)
Output:
Account PhoneType PhoneNumber
0 1 Home 555-555-1111
1 1 Work 555-555-2222
2 2 Home 555-555-3333
3 3 Mobile 555-555-4444
4 3 Work 555-555-5555
5 3 Home 555-555-6666
CodePudding user response:
try:
df
Account Key1 Key2 Val1 Val2 Key3 Val3
0 1 Home Work 555-555-1111 555-555-2222 NaN NaN
1 2 Home NaN 555-555-3333 NaN NaN NaN
2 3 Mobile Work 555-555-4444 555-555-5555 Home 555-555-6666
df['home_phone'] = np.select([df['Key1'].eq('Home'),df['Key2'].eq('Home'),df['Key3'].eq('Home')], [df['Val1'], df['Val2'], df['Val3']], None)
df['work_phone'] = np.select([df['Key1'].eq('Work'),df['Key2'].eq('Work'),df['Key3'].eq('Work')], [df['Val1'], df['Val2'], df['Val3']], None)
df['mobile_phone'] = np.select([df['Key1'].eq('Mobile'),df['Key2'].eq('Mobile'),df['Key3'].eq('Mobile')], [df['Val1'], df['Val2'], df['Val3']], None)
df = df.drop(columns=['Key1', 'Key2', 'Val1', 'Val2', 'Key3', 'Val3',])
df
Account home_phone work_phone mobile_phone
0 1 555-555-1111 555-555-2222 None
1 2 555-555-3333 None None
2 3 555-555-6666 555-555-5555 555-555-4444