Pandas Python Copy row based column value-CodePudding

I am working with data from a csv file. I've read in the dataframe using a pd.read.csv. Where the entry has a value in the 'Mobile_phone' column I would then like to duplicate the row and put the 'Mobile_phone' value in the 'Work_phone' column.

This is the data I'm starting with -

    Full name         Work_phone   Mobile_phone  Company
1   Amanda Brown      1234567896   77895641225   A company
2   Bert Sutherland   1234567897                 B company
3   Charlie Chaplin   1234567898                 C company
4   Derek Simpson     1234567899   77895641228   D company

This is the data I would like to return. Therefore removing the need for the 'Mobile_phone' data so I can marge with another dataset -

    Full name         Work_phone   Mobile_phone  Company
1   Amanda Brown      1234567896                 A company
2   Amanda Brown      77895641225                A company
3   Bert Sutherland   1234567897                 B company
4   Charlie Chaplin   1234567898                 C company
5   Derek Simpson     1234567899                 D company
6   Derek Simpson     77895641228                D company

CodePudding user response：

We can use set_index stack to do the reshaping from wide to long format. Then cleanup by droplevel the old column headers, reset_index to restore the RangeIndex and make a DataFrame again, then reorder the columns:

new_df = (
    df.set_index(['Full name', 'Company'])  # Columns to save
        .stack()  # go to long format
        .droplevel(-1)  # remove old column headers
        .reset_index(name='Work_phone')  # Restore Index and name new column
    [['Full name', 'Work_phone', 'Company']]  # re-order columns
)

new_df:

         Full name   Work_phone    Company
0     Amanda Brown   1234567896  A company
1     Amanda Brown  77895641225  A company
2  Bert Sutherland   1234567897  B company
3  Charlie Chaplin   1234567898  C company
4    Derek Simpson   1234567899  D company
5    Derek Simpson  77895641228  D company

Additionally, if needed, we can reindex instead of selecting columns to add back the Mobile_phone column:

new_df = (
    df.set_index(['Full name', 'Company'])  # Columns to save
        .stack()  # go to long format
        .droplevel(-1)  # remove old column headers
        .reset_index(name='Work_phone')  # Restore Index and name new column
        .reindex(
            # re-order columns and add missing columns
            columns=['Full name', 'Work_phone', 'Mobile_phone', 'Company']
        )
)

new_df:

         Full name   Work_phone  Mobile_phone    Company
0     Amanda Brown   1234567896           NaN  A company
1     Amanda Brown  77895641225           NaN  A company
2  Bert Sutherland   1234567897           NaN  B company
3  Charlie Chaplin   1234567898           NaN  C company
4    Derek Simpson   1234567899           NaN  D company
5    Derek Simpson  77895641228           NaN  D company

Setup used:

import pandas as pd
from numpy import nan

df = pd.DataFrame({
    'Full name': ['Amanda Brown', 'Bert Sutherland', 'Charlie Chaplin',
                  'Derek Simpson'],
    'Work_phone': [1234567896, 1234567897, 1234567898, 1234567899],
    'Mobile_phone': ['77895641225', nan, nan, '77895641228'],
    'Company': ['A company', 'B company', 'C company', 'D company']
})

Note: if Mobile_phone contains empty strings ('') instead of NaN may need to remove those first with mask otherwise stack will not automatically drop the unneeded rows:

df['Mobile_phone'] = df['Mobile_phone'].mask(df['Mobile_phone'].eq(''))

CodePudding user response：

TLDR

work_phone_df = df.drop("Mobile_phone", axis=1)
mobile_phone_df = df.drop("Work_phone", axis=1).dropna(subset=["Mobile_phone"]).rename(columns={"Mobile_phone": "Work_phone"})
new_df = pd.concat([work_phone_df, mobile_phone_df])
# if you need to sort your data and fix the index
new_df = new_df.sort_values(["Full name"]).reset_index(drop=True)

Each step explained:

First, you can get a copy of the dataframe with everyone's name, company, and work phone.

work_phone_df = df.drop("Mobile_phone", axis=1)

         Full name  Work_phone    Company
0     Amanda Brown  1234567896  A company
1  Bert Sutherland  1234567897  B company
2  Charlie Chaplin  1234567898  C company
3    Derek Simpson  1234567899  D company

Then, get a copy of the dataframe with everyone who has a mobile phone, but with the "Mobile_phone" column renamed to "Work_phone".

mobile_phone_df = df.drop("Work_phone", axis=1).dropna(subset=["Mobile_phone"]).rename(columns={"Mobile_phone": "Work_phone"})

       Full name   Work_phone    Company
0   Amanda Brown  77895641225  A company
3  Derek Simpson  77895641228  D company

Now, you can concatenate them together.

new_df = pd.concat([work_phone_df, mobile_phone_df])

         Full name   Work_phone    Company
0     Amanda Brown   1234567896  A company
1  Bert Sutherland   1234567897  B company
2  Charlie Chaplin   1234567898  C company
3    Derek Simpson   1234567899  D company
0     Amanda Brown  77895641225  A company
3    Derek Simpson  77895641228  D company

I'm not sure if you need this result to be sorted, but you can sort the dataframe using

new_df = new_df.sort_values(["Full name"])

         Full name   Work_phone    Company
0     Amanda Brown   1234567896  A company
0     Amanda Brown  77895641225  A company
1  Bert Sutherland   1234567897  B company
2  Charlie Chaplin   1234567898  C company
3    Derek Simpson   1234567899  D company
3    Derek Simpson  77895641228  D company

If you need to renumber the index, you can do something like

new_df = new_df.reset_index(drop=True)

         Full name   Work_phone    Company
0     Amanda Brown   1234567896  A company
1     Amanda Brown  77895641225  A company
2  Bert Sutherland   1234567897  B company
3  Charlie Chaplin   1234567898  C company
4    Derek Simpson   1234567899  D company
5    Derek Simpson  77895641228  D company