Home > OS >  Python - pandas: create new columns and transpose depending on column names
Python - pandas: create new columns and transpose depending on column names

Time:09-12

Given the following dataframe definition,

data = pd.DataFrame({"sec":["dim1", "dim2", "dim3"],  # Construct a pandas DataFrame
                     "m1vc_15":[1,2,3],
                     "m1vu_15":[10,11,12],
                     "m2vc_15":[20,21,22],
                     "m2vu_15":[30,31,32],
                     "m1vc_16":[4,5,6],
                     "m1vu_16":[13,14,15],
                     "m2vc_16":[23,24,25],
                     "m2vu_16":[33,34,35]})

which generates this table structure:

sec  m1vc_15  m1vu_15  m2vc_15  m2vu_15  m1vc_16  m1vu_16  m2vc_16  m2vu_16
dim1        1       10       20       30        4       13       23       33
dim2        2       11       21       31        5       14       24       34
dim3        3       12       22       32        6       15       25       35

Is there any function or package that allows me to create this expected outcome?

data2 = pd.DataFrame({"sec":np.repeat(["dim1", "dim2", "dim3", "dim1", "dim2", "dim3"],2),  # Construct a pandas DataFrame
                     "type":np.repeat(['vc','vu'],6),
                     "year":["2015", "2016", "2015", "2016", "2015", "2016",
                             "2015", "2016", "2015", "2016", "2015", "2016"],
                     "m1":[1, 4, 2, 5, 3, 6, 10, 13, 11, 14, 12, 15],
                     "m2":[20, 23, 21, 24, 22, 25, 30, 33, 31, 34, 32, 35]})
""" 
# Output
sec type year  m1  m2
dim1   vc 2015   1  20
dim1   vc 2016   4  23
dim2   vc 2015   2  21
dim2   vc 2016   5  24
dim3   vc 2015   3  22
dim3   vc 2016   6  25
dim1   vu 2015  10  30
dim1   vu 2016  13  33
dim2   vu 2015  11  31
dim2   vu 2016  14  34
dim3   vu 2015  12  32
dim3   vu 2016  15  35
"""

The logic for the transpose is:

  • The column names with metrics always end with the pattern _integer. The integer is the last two digits of the year.
  • The year of the metric needs to be transposed as a new column.
  • The data source has more than 2 years, but the logic is the same.
  • Before the year in the column's name, there is always the string vu_ or vc_. This string differentiates different types of objects. There will be no other value for type in the data source.
  • This string needs to be transposed as a new column.
  • Once removed the type and year, the rest of the column name is the final metric name.
  • The data source contains more metrics than the ones in this example.

CodePudding user response:

You can use wide_to_long with a few more steps to clean up the columns:

df = pd.wide_to_long(data, stubnames=['m1', 'm2'], i=['sec'], j='type_year', suffix='\w ').reset_index()
df[['type', 'year']] = df.type_year.str.split('_', expand=True)
df.year = '20'   df.year
df.drop('type_year', axis=1, inplace=True)

df
     sec  m1  m2 type  year
0   dim1   1  20   vc  2015
1   dim2   2  21   vc  2015
2   dim3   3  22   vc  2015
3   dim1  10  30   vu  2015
4   dim2  11  31   vu  2015
5   dim3  12  32   vu  2015
6   dim1   4  23   vc  2016
7   dim2   5  24   vc  2016
8   dim3   6  25   vc  2016
9   dim1  13  33   vu  2016
10  dim2  14  34   vu  2016
11  dim3  15  35   vu  2016

CodePudding user response:

One efficient option is with pivot_longer from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

(data
.pivot_longer(
    index = 'sec', 
    names_to = ('.value', 'type', 'year'), 
    names_pattern = r"(.\d)(. )_(\d )")
.assign(year = lambda df: ("20" df.year).astype(int))
)
     sec type  year  m1  m2
0   dim1   vc  2015   1  20
1   dim2   vc  2015   2  21
2   dim3   vc  2015   3  22
3   dim1   vu  2015  10  30
4   dim2   vu  2015  11  31
5   dim3   vu  2015  12  32
6   dim1   vc  2016   4  23
7   dim2   vc  2016   5  24
8   dim3   vc  2016   6  25
9   dim1   vu  2016  13  33
10  dim2   vu  2016  14  34
11  dim3   vu  2016  15  35

The .value determines which columns remain as header, which in this case is m1/m2; the rest are lumped into the type and year columns

A faster option would be to convert the year column into an integer within pivot_longer, by using the names_transform option:

(data
.pivot_longer(
    index = 'sec', 
    names_to = ('.value', 'type', 'year'), 
    names_pattern = r"(.\d)(. )_(\d )",
    names_transform = {'year': int})
.assign(year = lambda df: df.year   2000)
)

     sec type  year  m1  m2
0   dim1   vc  2015   1  20
1   dim2   vc  2015   2  21
2   dim3   vc  2015   3  22
3   dim1   vu  2015  10  30
4   dim2   vu  2015  11  31
5   dim3   vu  2015  12  32
6   dim1   vc  2016   4  23
7   dim2   vc  2016   5  24
8   dim3   vc  2016   6  25
9   dim1   vu  2016  13  33
10  dim2   vu  2016  14  34
11  dim3   vu  2016  15  35
  • Related