Home > Back-end >  How to use Pandas dataframe transform 2D dataset into 3D?
How to use Pandas dataframe transform 2D dataset into 3D?

Time:09-23

I have a tropical cyclone dataset like this: TC dataset

I want to group the dataset according to the column 'TC_name', i.e. one TC one group, but 'TC_name' may be identical for different years and the same TC may travel interannually, so groupby() maybe doesn't work.

I also tried to loop line by line and find where the 'TC_name' changes, cut the dataset there to get a dataframe for each TC. However, this gives me over 1000 dataframes and I don't know how to represent this 2D information in a 3D formation.

I guess multi index might work through assigning a high level index to each TC, but I cannot reach it after trying many times.

An example is provided below for clarification: (I need seperate 3 tropical cyclones)

import pandas as pd
data = {'TC_name':['RITA','RITA','IDA','IDA','IDA','IDA','RITA','RITA','RITA'],
        'Year':[2020,2020,2020,2020,2020,2021,2021,2021,2021]}
df = pd.DataFrame(data)
df
    TC_name Year
0   RITA    2020
1   RITA    2020
2   IDA     2020
3   IDA     2020
4   IDA     2020
5   IDA     2021
6   RITA    2021
7   RITA    2021
8   RITA    2021

And it would be better to get this format, or any other efficient 3D data structure representation.

         TC_name    Year
   1   1    RITA    2020
       2    RITA    2020
   2   1    IDA     2020
       2    IDA     2020
       3    IDA     2020
       4    IDA     2021
   3   1    RITA    2021
       2    RITA    2021
       3    RITA    2021

Thanks a lot for your help!

CodePudding user response:

Edit

Here another possibility. We reset the index to keep the original sorting:

df = df.reset_index()

Then we assign a new id within each TC_name group:

df['id2'] = df.groupby('TC_name')['index'].transform(lambda x: pd.factorize(x)[0]   1)

Here the intermediate output:

index TC_name Year id2
0 0 RITA 2020 1
1 1 RITA 2020 2
2 2 IDA 2020 1
3 3 IDA 2020 2
4 4 IDA 2020 3
5 5 IDA 2021 4

Then we assign a new id every time the TC_name changes:

df['id'] = (df["TC_name"] != df["TC_name"].shift(1)).cumsum()

And finally we drop the index and set the multiindex:

df = df.drop('index', axis=1).set_index(['id', 'id2'])

Here the final output, which looks like yours:

TC_name Year
1 1 RITA 2020
2 RITA 2020
2 1 IDA 2020
2 IDA 2020
3 IDA 2020
4 IDA 2021
3 3 RITA 2021
4 RITA 2021
5 RITA 2021
  • Related