I have a single row data-frame like below
Num TP1(USD) TP2(USD) TP3(USD) VReal1(USD) VReal2(USD) VReal3(USD) TiV1 (EUR) TiV2 (EUR) TiV3 (EUR) TR TR-Tag
AA-24 0 700 2100 300 1159 2877 30 30 47 10 5
I want to get a dataframe like the one below
ID Price Net Range
1 0 300 30
2 700 1159 30
3 2100 2877 47
The logic here is that a. there will be 3 columns names that contain TP/VR/TV. So in the ID, we have 1, 2 & 3 (these can be generated by extracting the value from the column names or just by using a range to fill) b. TP1 value goes into first row of column 'Price',TP2 value goes into second row of column 'Price' & so on c. Same for VR & TV. The values go into 'Net' & 'Range columns d. Columns 'Num', 'TR' & 'TR=Tag' are not relevant for the result.
I tried df.filter(regex='TP').stack()
. I get all the 'TP' column & I can access individual values be index ([0],[1],[2]). I could not get all of them into a column directly.
I also wondered if there may be a easier way of doing this.
CodePudding user response:
Assuming 'Num'
is a unique identifier, you can use pandas.wide_to_long
:
pd.wide_to_long(df, stubnames=['TP', 'VR', 'TV'], i='Num', j='ID')
or, for an output closer to yours:
out = (pd
.wide_to_long(df, stubnames=['TP', 'VR', 'TV'], i='Num', j='ID')
.reset_index('ID')
.drop(columns=['TR', 'TR-Tag'])
.rename(columns={'TP': 'Price', 'VR': 'Net', 'TV': 'Range'})
)
output:
ID Price Net Range
Num
AA-24 1 0 300 30
AA-24 2 700 1159 30
AA-24 3 2100 2877 47
updated answer
out = (pd
.wide_to_long(df.set_axis(df.columns.str.replace(r'\(USD\)$', '', regex=True),
axis=1),
stubnames=['TP', 'VReal', 'TiV'], i='Num', j='ID')
.reset_index('ID')
.drop(columns=['TR', 'TR-Tag'])
.rename(columns={'TP': 'Price', 'VReal': 'Net', 'TiV': 'Range'})
)
output:
ID Price Net Range
Num
AA-24 1 0 300 30
AA-24 2 700 1159 30
AA-24 3 2100 2877 47
CodePudding user response:
let us create a Multiindex
then use .stack
df1 = df.filter(regex='TP|VR|TV')
#i couldn't figure out to split by
#word\number without creating an additional whitespace split.
df1.columns = df1.columns\
.str.replace('(\d )', r' \1' ,regex=True).str.split(' ',expand=True)
#or more succinctly.
df1.columns = pd.MultiIndex.from_frame(df1.columns.str.extract('(\D )(\d )'))
print(df1)
TP VR TV
1 2 3 1 2 3 1 2 3
0 0 700 2100 300 1159 2877 30 30 47
df1.stack(1).rename(columns={'TP': 'Price', 'VR': 'Net', 'TV': 'Range'})
Price Range Net
0 1 0 30 300
2 700 30 1159
3 2100 47 2877
CodePudding user response:
pivot_wider
(see mozway's answer) is probably best here from a pure pandas perspective, but if you need more flexibility, you could also melt
and pivot
:
import pandas as pd
# recreating your dataframe
df = pd.DataFrame(['AA-24', '0', '700', '2100', '300', '1159', '2877', '30', '30', '47', '10', '5'],
index= ['Num', 'TP1(USD)', 'TP2(USD)', 'TP3(USD)', 'VReal1(USD)', 'VReal2(USD)', 'VReal3(USD)', 'TiV1(EUR)', 'TiV2(EUR)', 'TiV3(EUR)', 'TR', 'TR-Tag']).T
# reshaping the data
(df.melt(id_vars=['Num','TR', 'TR-Tag'])
.assign(col=lambda x: x['variable'].str[:2], idx=lambda x: x['variable'].str.extract("([0-9])"))
.pivot(values='value', columns='col', index='idx')
.rename(columns={'TP': 'Price', 'VR': 'Net', 'Ti': 'Range'})
)
Perhaps surprisingly, this is also faster than wide_to_long
. Benchmarking gives 7.76 ms ± 841 µs per loop for this method.
The wide_to_long
approach from mozway:
(pd
.wide_to_long(df.set_axis(df.columns.str.replace(r'\([A-Z]{3}\)$', '', regex=True),
axis=1),
stubnames=['TP', 'VReal', 'TiV'], i='Num', j='ID')
.reset_index('ID')
.drop(columns=['TR', 'TR-Tag'])
.rename(columns={'TP': 'Price', 'VReal': 'Net', 'TiV': 'Range'})
)
benchmarks at 30.4 ms ± 3.07 ms per loop on my machine.
Umar.H's answer using stack
is the faster than both:
df1 = df.filter(regex='TP|VR|TV')
df1.columns = df1.columns\
.str.replace('(\d )', r' \1' ,regex=True).str.split(' ',expand=True)
df1.stack(1).rename(columns={'TP': 'Price', 'VR': 'Net', 'TV': 'Range'})
Runs at 6.07 ms ± 156 µs per loop
If you don't mind the additional import, sammywemmy's answer using pyjanitor's pivot_wider
offers speed and an elegant syntax.
(df
.select_columns('TP*', 'VR*', 'Ti*')
.pivot_longer(index = None,
names_to = ('.value', 'ID'),
names_pattern = ('(. )(\d). '))
.rename(columns = {'TP':'Price', 'VReal':'Net', 'TiV':'Range'})
)
benchmarks at 11.2 ms ± 229 µs per loop
and the names pattern approach:
df.pivot_longer(index = None,
names_to = ('Price', 'Net', 'Range'),
names_pattern = ('TP.*', 'VR.*', 'Ti.*'),
ignore_index = False)
is the fastest of the lot as tested, coming in at 3.53 ms ± 95 µs per loop.
(It is worth noting that this dataset is probably too small to care about speed, and the order may not be the same on larger datasets)
CodePudding user response:
One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.select_columns('TP*', 'VR*', 'Ti*')
.pivot_longer(index = None,
names_to = ('.value', 'ID'),
names_pattern = ('(. )(\d). '))
.rename(columns = {'TP':'Price', 'VReal':'Net', 'TiV':'Range'})
)
ID Price Net Range
0 1 0 300 30
1 2 700 1159 30
2 3 2100 2877 47
In the above solution, regex pattern is used to extract the relevant sub labels in the columns ; .value
determines which of the sub labels remain as headers.
Another solution, that might be useful is to pass a list of regular expressions to names_pattern
parameter:
df.pivot_longer(index = None,
names_to = ('Price', 'Net', 'Range'),
names_pattern = ('TP.*', 'VR.*', 'Ti.*'),
ignore_index = False)
Price Net Range
0 0 300 30
0 700 1159 30
0 2100 2877 47
CodePudding user response:
IIUC, you can use:
df = pd.DataFrame({'TP1':[0], 'TP2':[700], 'TP3':[2100], 'VR1':[300], 'VR2':[1159], 'VR3':[2877], 'TV1':[30], 'TV2':[30], 'TV3':[47]})
pd.wide_to_long(df.reset_index(), ["TP", "VR", "TV"], i="index", j="Nr").droplevel('index').rename(columns={'TP': 'Price', 'VR': 'Net', 'TV': 'Range'})
Result:
Price Net Range
Nr
1 0 300 30
2 700 1159 30
3 2100 2877 47