Home > Net >  Create new DataFrame based on existing DataFrame
Create new DataFrame based on existing DataFrame

Time:06-17

I want to reorganize some data in a Pandas DataFrame from an existing DataFrame such that I can have a subset of row values (in the minimal example below 'City' as columns, and remove rows not having either of the chosen values, below 'New York' and 'Amsterdam').

To illustrate what I mean by the above, I have a DataFrame that looks like this:

# Id |  Start_Time                       | End_Time                           | City       | Price
--------------------------------------
1 | 2022-01-01 00:00:00.0000000  01:00 | 2022-01-01 01:00:00.0000000  01:00 | New York   | 100
2 | 2022-01-01 01:00:00.0000000  01:00 | 2022-01-01 02:00:00.0000000  01:00 | New York   | 90
3 | 2022-01-01 02:00:00.0000000  01:00 | 2022-01-01 03:00:00.0000000  01:00 | New York   | 95
4 | 2022-01-01 01:00:00.0000000  01:00 | 2022-01-01 02:00:00.0000000  01:00 | Amsterdam  | 300
5 | 2022-01-01 02:00:00.0000000  01:00 | 2022-01-01 03:00:00.0000000  01:00 | Amsterdam  | 250

I would like to create a DataFrame that looks like this:

# Period                                                                | New York | Amsterdam | Difference
------------------------------------------------------------------------------------------
2022-01-01 00:00:00.0000000  01:00 - 2022-01-01 01:00:00.0000000  01:00 | 100      | NaN |  NaN          
2022-01-01 01:00:00.0000000  01:00 - 2022-01-01 02:00:00.0000000  01:00 | 90       | 300 | -210
2022-01-01 02:00:00.0000000  01:00 - 2022-01-01 03:00:00.0000000  01:00 | 95       | 250 | -155

Current code:

Here is my code so far:

import pandas as pd

data = [
    [1, '2022-01-01 00:00:00.0000000  01:00', '2022-01-01 01:00:00.0000000  01:00', 'New York', 100],
    [2, '2022-01-01 01:00:00.0000000  01:00', '2022-01-01 02:00:00.0000000  01:00', 'New York', 90], 
    [3, '2022-01-01 02:00:00.0000000  01:00', '2022-01-01 03:00:00.0000000  01:00', 'New York', 95],
    [4, '2022-01-01 01:00:00.0000000  01:00', '2022-01-01 02:00:00.0000000  01:00', 'Amsterdam', 300], 
    [5, '2022-01-01 02:00:00.0000000  01:00', '2022-01-01 03:00:00.0000000  01:00', 'Amsterdam', 250]
]

df = pd.DataFrame(data, columns=['Id', 'Start_Time', 'End_Time', 'City', 'Price'])

ny = df[df['City'] == 'New York']
amsterdam = df[df['Auction'] == 'Amsterdam']

# Here I naively try to create a new DataFrame with the price in the two cities as colums.
# I figure I can then add a new column with the difference with df_new['Difference'] = df_new['New York'] - df_new['Amsterdam']
df_new = pd.DataFrame().assign(ny=ny['Price'], amsterdam=amsterdam['Price'])

How do I go from a DataFrame that looks like the first one to one that looks like the latter?

CodePudding user response:

df['Period] = df['Start_Time'].astype(str)   " "   df['End_time].astype(str)

CodePudding user response:

As order is important we first need to create a categorical column based on City.

Then we can create a pivot using pd.pivot_table

df['City'] = pd.Categorical(df['City'], ordered=True, categories=['New York', 'Amsterdam'])


df1 = df.pivot_table(index='period', columns='City', values='Price',aggfunc='first')

Finally, your difference column

df1['sub'] = df1.diff(axis=1).iloc[:,-1]


print(df1)

City                                                New York  Amsterdam    sub
period
2022-01-01 00:00:00.0000000  01:00 - 2022-01-01...     100.0        NaN    NaN
2022-01-01 01:00:00.0000000  01:00 - 2022-01-01...      90.0      300.0  210.0
2022-01-01 02:00:00.0000000  01:00 - 2022-01-01...      95.0      250.0  155.0
  • Related