I have a dataframe with multiple levels of Multiindex. One of the levels is latlon
, which is a string of the numbers with an ;
between them.
However, for further processing, it makes much more sense to have a lat
level and a lon
level. with floats for the numbers, instead of the combined string.
How do I best partition this level into two levels?
I have a solution, but it doesn't seem very pythonic and requires building a new dataframe, so I'm looking for a better way.
MWE:
Set up a simple test df:
number = [1, 2, 3]
name = ['foo', 'bar', 'baz']
latlon = ['10.1;50.1', '12.2;52.2', '13.3;53.3']
idx = pd.MultiIndex.from_arrays([number, name, latlon],
names=('number','name', 'latlon'))
data = np.random.rand(4,3)
df = pd.DataFrame(data=data, columns=idx)
(Original data has 10 levels in the Multiindex and is of size 25000, 750
)
As you can see, latlon
is easily human-readble, but not particularly useful. I want a lat
and lon
level, with floats.
What I've come up with:
# get a list of them, to iterate through
latlons = df.columns.get_level_values('latlon').to_list()
# set up emptly lists and start iterating
lats = []
lons = []
for i in latlons:
# do some string searches and split by positions
start_str = i.find(';') 1
end_str = i.find('\n')
lon_str = i[0:start_str-1]
lon = float(lon_str)
lons.append(lon)
lat_str = i[start_str:end_str]
lat = float(lat_str)
lats.append(lat)
Now there's two lists, one with lats and one with lons, which can be used to build a new index and thus a new df:
number = df.columns.get_level_values('number').to_list()
# I can't reuse 'number' from the initial setup, since the original
# comes from an excel import, so I must extract it here.
name = df.columns.get_level_values('name').to_list()
idx = pd.MultiIndex.from_arrays([number, name, lats, lons],
names=('number','name', 'lat', 'lon'))
data = df.values
df2 = pd.DataFrame(data=data, columns=idx)
This works and is very easy to understand, but it all feels very hacky and one hickup away from mixing up data.
Is there a simpler/better way?
CodePudding user response:
I would temporarily convert the MultiIndex to DataFrame to benefit from DataFrame
's methods:
new_idx = pd.MultiIndex.from_frame(
df.columns.to_frame()
.pipe(lambda d: d.join(d.pop('latlon')
.str.split(';', expand=True)
.set_axis(['lat', 'lon'], axis=1)
))
)
df.columns = new_idx
Output:
number 1 2 3
name foo bar baz
lat 10.1 12.2 13.3
lon 50.1 52.2 53.3
0 0.796467 0.769194 0.733470
1 0.272247 0.558985 0.345007
2 0.209480 0.669443 0.648002
3 0.466146 0.262006 0.236987
CodePudding user response:
extract the index, split, and rebuild the index:
from itertools import chain
latlon = df.columns.get_level_values('latlon')
df.columns = df.columns.droplevel('latlon')
latlon = [map(float, ent) for ent in latlon.str.split(';')]
latlon = [tuple(chain.from_iterable(ent)) for ent in zip(df.columns, latlon)]
names = list(df.columns.names)
names.extend(['lat','lon'])
df.columns = pd.MultiIndex.from_tuples(latlon, names = names)
df
number 1 2 3
name foo bar baz
lat 10.1 12.2 13.3
lon 50.1 52.2 53.3
0 0.509162 0.645502 0.873028
1 0.761990 0.557842 0.787984
2 0.902023 0.776042 0.823232
3 0.314977 0.665694 0.072290