Given the following dataframe definition,
data = pd.DataFrame({"sec":["dim1", "dim2", "dim3"], # Construct a pandas DataFrame
"m1vc_15":[1,2,3],
"m1vu_15":[10,11,12],
"m2vc_15":[20,21,22],
"m2vu_15":[30,31,32],
"m1vc_16":[4,5,6],
"m1vu_16":[13,14,15],
"m2vc_16":[23,24,25],
"m2vu_16":[33,34,35]})
which generates this table structure:
sec m1vc_15 m1vu_15 m2vc_15 m2vu_15 m1vc_16 m1vu_16 m2vc_16 m2vu_16
dim1 1 10 20 30 4 13 23 33
dim2 2 11 21 31 5 14 24 34
dim3 3 12 22 32 6 15 25 35
Is there any function or package that allows me to create this expected outcome?
data2 = pd.DataFrame({"sec":np.repeat(["dim1", "dim2", "dim3", "dim1", "dim2", "dim3"],2), # Construct a pandas DataFrame
"type":np.repeat(['vc','vu'],6),
"year":["2015", "2016", "2015", "2016", "2015", "2016",
"2015", "2016", "2015", "2016", "2015", "2016"],
"m1":[1, 4, 2, 5, 3, 6, 10, 13, 11, 14, 12, 15],
"m2":[20, 23, 21, 24, 22, 25, 30, 33, 31, 34, 32, 35]})
"""
# Output
sec type year m1 m2
dim1 vc 2015 1 20
dim1 vc 2016 4 23
dim2 vc 2015 2 21
dim2 vc 2016 5 24
dim3 vc 2015 3 22
dim3 vc 2016 6 25
dim1 vu 2015 10 30
dim1 vu 2016 13 33
dim2 vu 2015 11 31
dim2 vu 2016 14 34
dim3 vu 2015 12 32
dim3 vu 2016 15 35
"""
The logic for the transpose is:
- The column names with metrics always end with the pattern
_integer
. The integer is the last two digits of the year. - The year of the metric needs to be transposed as a new column.
- The data source has more than 2 years, but the logic is the same.
- Before the year in the column's name, there is always the string
vu_
orvc_
. This string differentiates different types of objects. There will be no other value for type in the data source. - This string needs to be transposed as a new column.
- Once removed the type and year, the rest of the column name is the final metric name.
- The data source contains more metrics than the ones in this example.
CodePudding user response:
You can use wide_to_long
with a few more steps to clean up the columns:
df = pd.wide_to_long(data, stubnames=['m1', 'm2'], i=['sec'], j='type_year', suffix='\w ').reset_index()
df[['type', 'year']] = df.type_year.str.split('_', expand=True)
df.year = '20' df.year
df.drop('type_year', axis=1, inplace=True)
df
sec m1 m2 type year
0 dim1 1 20 vc 2015
1 dim2 2 21 vc 2015
2 dim3 3 22 vc 2015
3 dim1 10 30 vu 2015
4 dim2 11 31 vu 2015
5 dim3 12 32 vu 2015
6 dim1 4 23 vc 2016
7 dim2 5 24 vc 2016
8 dim3 6 25 vc 2016
9 dim1 13 33 vu 2016
10 dim2 14 34 vu 2016
11 dim3 15 35 vu 2016
CodePudding user response:
One efficient option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(data
.pivot_longer(
index = 'sec',
names_to = ('.value', 'type', 'year'),
names_pattern = r"(.\d)(. )_(\d )")
.assign(year = lambda df: ("20" df.year).astype(int))
)
sec type year m1 m2
0 dim1 vc 2015 1 20
1 dim2 vc 2015 2 21
2 dim3 vc 2015 3 22
3 dim1 vu 2015 10 30
4 dim2 vu 2015 11 31
5 dim3 vu 2015 12 32
6 dim1 vc 2016 4 23
7 dim2 vc 2016 5 24
8 dim3 vc 2016 6 25
9 dim1 vu 2016 13 33
10 dim2 vu 2016 14 34
11 dim3 vu 2016 15 35
The .value
determines which columns remain as header, which in this case is m1/m2
; the rest are lumped into the type
and year
columns
A faster option would be to convert the year
column into an integer within pivot_longer
, by using the names_transform
option:
(data
.pivot_longer(
index = 'sec',
names_to = ('.value', 'type', 'year'),
names_pattern = r"(.\d)(. )_(\d )",
names_transform = {'year': int})
.assign(year = lambda df: df.year 2000)
)
sec type year m1 m2
0 dim1 vc 2015 1 20
1 dim2 vc 2015 2 21
2 dim3 vc 2015 3 22
3 dim1 vu 2015 10 30
4 dim2 vu 2015 11 31
5 dim3 vu 2015 12 32
6 dim1 vc 2016 4 23
7 dim2 vc 2016 5 24
8 dim3 vc 2016 6 25
9 dim1 vu 2016 13 33
10 dim2 vu 2016 14 34
11 dim3 vu 2016 15 35