I want to reorganize some data in a Pandas DataFrame from an existing DataFrame such that I can have a subset of row values (in the minimal example below 'City' as columns, and remove rows not having either of the chosen values, below 'New York' and 'Amsterdam').
To illustrate what I mean by the above, I have a DataFrame that looks like this:
# Id | Start_Time | End_Time | City | Price
--------------------------------------
1 | 2022-01-01 00:00:00.0000000 01:00 | 2022-01-01 01:00:00.0000000 01:00 | New York | 100
2 | 2022-01-01 01:00:00.0000000 01:00 | 2022-01-01 02:00:00.0000000 01:00 | New York | 90
3 | 2022-01-01 02:00:00.0000000 01:00 | 2022-01-01 03:00:00.0000000 01:00 | New York | 95
4 | 2022-01-01 01:00:00.0000000 01:00 | 2022-01-01 02:00:00.0000000 01:00 | Amsterdam | 300
5 | 2022-01-01 02:00:00.0000000 01:00 | 2022-01-01 03:00:00.0000000 01:00 | Amsterdam | 250
I would like to create a DataFrame that looks like this:
# Period | New York | Amsterdam | Difference
------------------------------------------------------------------------------------------
2022-01-01 00:00:00.0000000 01:00 - 2022-01-01 01:00:00.0000000 01:00 | 100 | NaN | NaN
2022-01-01 01:00:00.0000000 01:00 - 2022-01-01 02:00:00.0000000 01:00 | 90 | 300 | -210
2022-01-01 02:00:00.0000000 01:00 - 2022-01-01 03:00:00.0000000 01:00 | 95 | 250 | -155
Current code:
Here is my code so far:
import pandas as pd
data = [
[1, '2022-01-01 00:00:00.0000000 01:00', '2022-01-01 01:00:00.0000000 01:00', 'New York', 100],
[2, '2022-01-01 01:00:00.0000000 01:00', '2022-01-01 02:00:00.0000000 01:00', 'New York', 90],
[3, '2022-01-01 02:00:00.0000000 01:00', '2022-01-01 03:00:00.0000000 01:00', 'New York', 95],
[4, '2022-01-01 01:00:00.0000000 01:00', '2022-01-01 02:00:00.0000000 01:00', 'Amsterdam', 300],
[5, '2022-01-01 02:00:00.0000000 01:00', '2022-01-01 03:00:00.0000000 01:00', 'Amsterdam', 250]
]
df = pd.DataFrame(data, columns=['Id', 'Start_Time', 'End_Time', 'City', 'Price'])
ny = df[df['City'] == 'New York']
amsterdam = df[df['Auction'] == 'Amsterdam']
# Here I naively try to create a new DataFrame with the price in the two cities as colums.
# I figure I can then add a new column with the difference with df_new['Difference'] = df_new['New York'] - df_new['Amsterdam']
df_new = pd.DataFrame().assign(ny=ny['Price'], amsterdam=amsterdam['Price'])
How do I go from a DataFrame that looks like the first one to one that looks like the latter?
CodePudding user response:
df['Period] = df['Start_Time'].astype(str) " " df['End_time].astype(str)
CodePudding user response:
As order is important we first need to create a categorical column based on City.
Then we can create a pivot using pd.pivot_table
df['City'] = pd.Categorical(df['City'], ordered=True, categories=['New York', 'Amsterdam'])
df1 = df.pivot_table(index='period', columns='City', values='Price',aggfunc='first')
Finally, your difference column
df1['sub'] = df1.diff(axis=1).iloc[:,-1]
print(df1)
City New York Amsterdam sub
period
2022-01-01 00:00:00.0000000 01:00 - 2022-01-01... 100.0 NaN NaN
2022-01-01 01:00:00.0000000 01:00 - 2022-01-01... 90.0 300.0 210.0
2022-01-01 02:00:00.0000000 01:00 - 2022-01-01... 95.0 250.0 155.0