I have a dataframe as follows:
arrays = [np.array(["berlin", "berlin", "paris", "paris", "rome", "rome", "seville", "seville"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays, columns = ['mike','ana','manu','analia'])
it has a multiindex in the rows. I would like to transform that DF into another one that would also have a multindex in the columns.
the function could be generalized as this:
def sortit(colname):
if colname.startswith('m'):
return 'm'
elif colname.startswith('m'):
return 'a'
The expected output is the following:
arrays = [np.array(["berlin", "berlin", "paris", "paris", "rome", "rome", "seville", "seville"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"])]
tuples_i = list(zip(*arrays))
index_rows = pd.MultiIndex.from_tuples(tuples_i, names=["city", "number"])
arrays2 = [np.array(["m","m", "a","a"]),
np.array(['mike','manu','ana','analia'])]
tuples_c = list(zip(*arrays2))
print(tuples)
index_columns = pd.MultiIndex.from_tuples(tuples_c, names=["department", "name"])
df = pd.DataFrame(np.random.randn(8, 4), index=index_rows, columns = index_columns)
df
two important notes.
My starting point is the dataframe with multindex in the rows and no multiindex in the cols. I can not change that.
The method that illustrates here to which multiindex belong every name (column name) is a straight forward one, as example, the reality is that that function is way more complicated and time consuming, that's why I would like to create once the multilevel col index to make queries later much faster.
CodePudding user response:
You can craft a new MultiIndex with MultiIndex.from_arrays
:
idx = pd.MultiIndex.from_arrays([df.columns.str.extract('(.)', expand=False),
df.columns],
names=['department', 'name'])
df.columns = idx
print(df.sort_index(level=0, axis=1))
Output:
department a m
name ana analia manu mike
berlin one 0.465270 -0.549246 0.931020 0.027496
two -2.156006 -2.053703 0.162281 0.741966
paris one 0.084072 1.729949 1.366554 0.402933
two 1.157244 1.762093 -1.808943 -1.737110
rome one -0.009257 -0.457297 -0.479836 -2.483149
two -0.593379 -0.012763 -1.491018 -0.439712
seville one -1.118433 0.029189 -0.805858 -0.342481
two -0.389120 -0.390189 -1.260496 -0.010572
CodePudding user response:
Code
you can make multi index easily from making tuple
(df.set_axis(df.columns.map(lambda x: (x[0], x)), axis=1)
.rename_axis(['department', 'name'], axis=1))
output:
department m a m a
name mike ana manu analia
berlin one 0.6 -0.0 2.9 1.3
two 1.3 0.4 0.0 -3.0
paris one -0.5 -0.8 0.4 0.0
two -0.6 -1.0 0.5 0.3
rome one -1.5 0.2 -0.0 1.4
two -1.5 -1.9 0.0 -0.0
seville one -1.3 1.3 0.7 0.5
two -0.2 -0.2 -0.7 0.4