Home > Software design >  reshaping the dataset in python
reshaping the dataset in python

Time:05-13

I have this dataset:

Account lookup FY11USD FY12USD FY11local FY12local
Sales CA 1000 5000 800 4800
Sales JP 5000 6500 10 15

Trying to arrive to get the data in this format: (below example has 2 years of data but no. of years can vary)

Account lookup Year USD Local
Sales CA FY11 1000 800
Sales CA FY12 5000 4800
Sales JP FY11 5000 10
Sales JP FY12 6500 15

I tried using the below script, but it doesn't segregate USD and local for the same year. How should I go about that?

df.melt(id_vars=["Account", "lookup"], 
    var_name="Year", 
    value_name="Value")

CodePudding user response:

You can piece it together like so:

dfn = (pd.concat( 
[df[["Account", "lookup", 'FY11USD','FY12USD']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="USD"),
df[["Account", "lookup", 'FY11local','FY12local']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="Local")[['Local']]], axis=1 ))
dfn['Year'] = dfn['Year'].str[:4]

Output

  Account lookup  Year   USD  Local
0   Sales     CA  FY11  1000    800
1   Sales     JP  FY11  5000     10
2   Sales     CA  FY12  5000   4800
3   Sales     JP  FY12  6500     15

CodePudding user response:

One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value placeholder ---> the .value determines which parts of the columns remain as headers:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(
     index = ['Account', 'lookup'], 
     names_to = ('Year', '.value'), 
     names_pattern = r"(FY\d )(. )")

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     JP  FY11  5000     10
2   Sales     CA  FY12  5000   4800
3   Sales     JP  FY12  6500     15

Another option is to use stack:

temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d )', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     CA  FY12  5000   4800
2   Sales     JP  FY11  5000     10
3   Sales     JP  FY12  6500     15

You can also pull it off with pd.wide_to_long after reshaping the columns:

index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
                .columns
                .str.split('(FY\d )')
                .str[::-1]
                .str.join('')
               )
(pd.wide_to_long(
      temp.reset_index(), 
      stubnames = ['USD', 'local'], 
      i = index, 
      j = 'Year', 
      suffix = '. ')
.reset_index()
)

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     CA  FY12  5000   4800
2   Sales     JP  FY11  5000     10
3   Sales     JP  FY12  6500     15
  • Related