Home > Back-end >  How to split a Multiindex row into two Multiindex rows?
How to split a Multiindex row into two Multiindex rows?

Time:11-23

I have a dataframe with multiple levels of Multiindex. One of the levels is latlon, which is a string of the numbers with an ; between them. However, for further processing, it makes much more sense to have a lat level and a lon level. with floats for the numbers, instead of the combined string.

How do I best partition this level into two levels?

I have a solution, but it doesn't seem very pythonic and requires building a new dataframe, so I'm looking for a better way.

MWE:

Set up a simple test df:

number = [1, 2, 3]
name = ['foo', 'bar', 'baz']
latlon = ['10.1;50.1', '12.2;52.2', '13.3;53.3']
idx = pd.MultiIndex.from_arrays([number, name, latlon], 
                                 names=('number','name', 'latlon'))
data = np.random.rand(4,3)
df = pd.DataFrame(data=data, columns=idx)

(Original data has 10 levels in the Multiindex and is of size 25000, 750) As you can see, latlon is easily human-readble, but not particularly useful. I want a lat and lon level, with floats.

What I've come up with:

# get a list of them, to iterate through
latlons = df.columns.get_level_values('latlon').to_list()
# set up emptly lists and start iterating
lats = []
lons = []
for i in latlons:
# do some string searches and split by positions
    start_str = i.find(';') 1 
    end_str = i.find('\n')
    lon_str = i[0:start_str-1]
    lon = float(lon_str)
    lons.append(lon)
    lat_str = i[start_str:end_str]
    lat = float(lat_str)
    lats.append(lat)

Now there's two lists, one with lats and one with lons, which can be used to build a new index and thus a new df:

number = df.columns.get_level_values('number').to_list()
# I can't reuse 'number' from the initial setup, since the original
# comes from an excel import, so I must extract it here.
name = df.columns.get_level_values('name').to_list()
idx = pd.MultiIndex.from_arrays([number, name, lats, lons], 
                                 names=('number','name', 'lat', 'lon'))
data = df.values
df2 = pd.DataFrame(data=data, columns=idx)

This works and is very easy to understand, but it all feels very hacky and one hickup away from mixing up data.

Is there a simpler/better way?

CodePudding user response:

I would temporarily convert the MultiIndex to DataFrame to benefit from DataFrame's methods:

new_idx = pd.MultiIndex.from_frame(
 df.columns.to_frame()
   .pipe(lambda d: d.join(d.pop('latlon')
                           .str.split(';', expand=True)
                           .set_axis(['lat', 'lon'], axis=1)
                         ))
)

df.columns = new_idx

Output:

number         1         2         3
name         foo       bar       baz
lat         10.1      12.2      13.3
lon         50.1      52.2      53.3
0       0.796467  0.769194  0.733470
1       0.272247  0.558985  0.345007
2       0.209480  0.669443  0.648002
3       0.466146  0.262006  0.236987

CodePudding user response:

extract the index, split, and rebuild the index:

from itertools import chain

latlon = df.columns.get_level_values('latlon')
df.columns = df.columns.droplevel('latlon')
latlon = [map(float, ent) for ent in latlon.str.split(';')]
latlon = [tuple(chain.from_iterable(ent)) for ent in zip(df.columns, latlon)]
names = list(df.columns.names)
names.extend(['lat','lon'])
df.columns = pd.MultiIndex.from_tuples(latlon, names = names)
df
number         1         2         3
name         foo       bar       baz
lat         10.1      12.2      13.3
lon         50.1      52.2      53.3
0       0.509162  0.645502  0.873028
1       0.761990  0.557842  0.787984
2       0.902023  0.776042  0.823232
3       0.314977  0.665694  0.072290
  • Related