I am working with data from a csv file. I've read in the dataframe using a pd.read.csv. Where the entry has a value in the 'Mobile_phone' column I would then like to duplicate the row and put the 'Mobile_phone' value in the 'Work_phone' column.
This is the data I'm starting with -
Full name Work_phone Mobile_phone Company
1 Amanda Brown 1234567896 77895641225 A company
2 Bert Sutherland 1234567897 B company
3 Charlie Chaplin 1234567898 C company
4 Derek Simpson 1234567899 77895641228 D company
This is the data I would like to return. Therefore removing the need for the 'Mobile_phone' data so I can marge with another dataset -
Full name Work_phone Mobile_phone Company
1 Amanda Brown 1234567896 A company
2 Amanda Brown 77895641225 A company
3 Bert Sutherland 1234567897 B company
4 Charlie Chaplin 1234567898 C company
5 Derek Simpson 1234567899 D company
6 Derek Simpson 77895641228 D company
CodePudding user response:
We can use set_index
stack
to do the reshaping from wide to long format. Then cleanup by droplevel
the old column headers, reset_index
to restore the RangeIndex and make a DataFrame again, then reorder the columns:
new_df = (
df.set_index(['Full name', 'Company']) # Columns to save
.stack() # go to long format
.droplevel(-1) # remove old column headers
.reset_index(name='Work_phone') # Restore Index and name new column
[['Full name', 'Work_phone', 'Company']] # re-order columns
)
new_df
:
Full name Work_phone Company
0 Amanda Brown 1234567896 A company
1 Amanda Brown 77895641225 A company
2 Bert Sutherland 1234567897 B company
3 Charlie Chaplin 1234567898 C company
4 Derek Simpson 1234567899 D company
5 Derek Simpson 77895641228 D company
Additionally, if needed, we can reindex
instead of selecting columns to add back the Mobile_phone
column:
new_df = (
df.set_index(['Full name', 'Company']) # Columns to save
.stack() # go to long format
.droplevel(-1) # remove old column headers
.reset_index(name='Work_phone') # Restore Index and name new column
.reindex(
# re-order columns and add missing columns
columns=['Full name', 'Work_phone', 'Mobile_phone', 'Company']
)
)
new_df
:
Full name Work_phone Mobile_phone Company
0 Amanda Brown 1234567896 NaN A company
1 Amanda Brown 77895641225 NaN A company
2 Bert Sutherland 1234567897 NaN B company
3 Charlie Chaplin 1234567898 NaN C company
4 Derek Simpson 1234567899 NaN D company
5 Derek Simpson 77895641228 NaN D company
Setup used:
import pandas as pd
from numpy import nan
df = pd.DataFrame({
'Full name': ['Amanda Brown', 'Bert Sutherland', 'Charlie Chaplin',
'Derek Simpson'],
'Work_phone': [1234567896, 1234567897, 1234567898, 1234567899],
'Mobile_phone': ['77895641225', nan, nan, '77895641228'],
'Company': ['A company', 'B company', 'C company', 'D company']
})
Note: if Mobile_phone contains empty strings (''
) instead of NaN
may need to remove those first with mask
otherwise stack
will not automatically drop the unneeded rows:
df['Mobile_phone'] = df['Mobile_phone'].mask(df['Mobile_phone'].eq(''))
CodePudding user response:
TLDR
work_phone_df = df.drop("Mobile_phone", axis=1)
mobile_phone_df = df.drop("Work_phone", axis=1).dropna(subset=["Mobile_phone"]).rename(columns={"Mobile_phone": "Work_phone"})
new_df = pd.concat([work_phone_df, mobile_phone_df])
# if you need to sort your data and fix the index
new_df = new_df.sort_values(["Full name"]).reset_index(drop=True)
Each step explained:
First, you can get a copy of the dataframe with everyone's name, company, and work phone.
work_phone_df = df.drop("Mobile_phone", axis=1)
Full name Work_phone Company
0 Amanda Brown 1234567896 A company
1 Bert Sutherland 1234567897 B company
2 Charlie Chaplin 1234567898 C company
3 Derek Simpson 1234567899 D company
Then, get a copy of the dataframe with everyone who has a mobile phone, but with the "Mobile_phone"
column renamed to "Work_phone"
.
mobile_phone_df = df.drop("Work_phone", axis=1).dropna(subset=["Mobile_phone"]).rename(columns={"Mobile_phone": "Work_phone"})
Full name Work_phone Company
0 Amanda Brown 77895641225 A company
3 Derek Simpson 77895641228 D company
Now, you can concatenate them together.
new_df = pd.concat([work_phone_df, mobile_phone_df])
Full name Work_phone Company
0 Amanda Brown 1234567896 A company
1 Bert Sutherland 1234567897 B company
2 Charlie Chaplin 1234567898 C company
3 Derek Simpson 1234567899 D company
0 Amanda Brown 77895641225 A company
3 Derek Simpson 77895641228 D company
I'm not sure if you need this result to be sorted, but you can sort the dataframe using
new_df = new_df.sort_values(["Full name"])
Full name Work_phone Company
0 Amanda Brown 1234567896 A company
0 Amanda Brown 77895641225 A company
1 Bert Sutherland 1234567897 B company
2 Charlie Chaplin 1234567898 C company
3 Derek Simpson 1234567899 D company
3 Derek Simpson 77895641228 D company
If you need to renumber the index, you can do something like
new_df = new_df.reset_index(drop=True)
Full name Work_phone Company
0 Amanda Brown 1234567896 A company
1 Amanda Brown 77895641225 A company
2 Bert Sutherland 1234567897 B company
3 Charlie Chaplin 1234567898 C company
4 Derek Simpson 1234567899 D company
5 Derek Simpson 77895641228 D company