How to use Pandas dataframe transform 2D dataset into 3D?-CodePudding

I have a tropical cyclone dataset like this: TC dataset

I want to group the dataset according to the column 'TC_name', i.e. one TC one group, but 'TC_name' may be identical for different years and the same TC may travel interannually, so groupby() maybe doesn't work.

I also tried to loop line by line and find where the 'TC_name' changes, cut the dataset there to get a dataframe for each TC. However, this gives me over 1000 dataframes and I don't know how to represent this 2D information in a 3D formation.

I guess multi index might work through assigning a high level index to each TC, but I cannot reach it after trying many times.

An example is provided below for clarification: (I need seperate 3 tropical cyclones)

import pandas as pd
data = {'TC_name':['RITA','RITA','IDA','IDA','IDA','IDA','RITA','RITA','RITA'],
        'Year':[2020,2020,2020,2020,2020,2021,2021,2021,2021]}
df = pd.DataFrame(data)
df

    TC_name Year
0   RITA    2020
1   RITA    2020
2   IDA     2020
3   IDA     2020
4   IDA     2020
5   IDA     2021
6   RITA    2021
7   RITA    2021
8   RITA    2021

And it would be better to get this format, or any other efficient 3D data structure representation.

         TC_name    Year
   1   1    RITA    2020
       2    RITA    2020
   2   1    IDA     2020
       2    IDA     2020
       3    IDA     2020
       4    IDA     2021
   3   1    RITA    2021
       2    RITA    2021
       3    RITA    2021

Thanks a lot for your help!

CodePudding user response：

Edit

Here another possibility. We reset the index to keep the original sorting:

df = df.reset_index()

Then we assign a new id within each TC_name group:

df['id2'] = df.groupby('TC_name')['index'].transform(lambda x: pd.factorize(x)[0]   1)

Here the intermediate output:

	index	TC_name	Year	id2
0	0	RITA	2020	1
1	1	RITA	2020	2
2	2	IDA	2020	1
3	3	IDA	2020	2
4	4	IDA	2020	3
5	5	IDA	2021	4

Then we assign a new id every time the TC_name changes:

df['id'] = (df["TC_name"] != df["TC_name"].shift(1)).cumsum()

And finally we drop the index and set the multiindex:

df = df.drop('index', axis=1).set_index(['id', 'id2'])

Here the final output, which looks like yours:

		TC_name	Year
1	1	RITA	2020
	2	RITA	2020
2	1	IDA	2020
	2	IDA	2020
	3	IDA	2020
	4	IDA	2021
3	3	RITA	2021
	4	RITA	2021
	5	RITA	2021