I have following dataframe of 3 persons posted at 3 places on 3 days of the week:
I want to convert above data to following table format:
Effectively, I am interchanging cell values with column names.
I want some pointer on how to do this programmatically using Python.
CodePudding user response:
Hear is something that could work
- Get new column names ie places by finding unique values in current
dataframe
- loop through new column names(places), add names of old column(people) as new entry for each new column in the new dataframe using
apply
method
Code:
import io
import pandas as pd
import numpy as np
txt="""
tom,dick
Mon,cafe,lib
Tue,lib,gym
Wed,gym,gym
"""
df = pd.read_csv(io.StringIO(txt))
print(df)
new_cols = np.unique(df.values)
new_df = pd.DataFrame(columns=new_cols)
for col in new_cols:
new_df[col] = df.apply(lambda x: df.columns[x == col].values, axis=1)
print(new_df)
Output:
old df
tom dick
Mon cafe lib
Tue lib gym
Wed gym gym
new df
cafe gym lib
Mon [tom] [] [dick]
Tue [] [dick] [tom]
Wed [] [tom, dick] []
or if multiple people are never at the sample place ie you don't want a list of values:
import io
import pandas as pd
import numpy as np
txt="""
tom,dick
Mon,cafe,lib
Tue,lib,gym
Wed,gym,cafe
"""
df = pd.read_csv(io.StringIO(txt))
print(df)
new_cols = np.unique(df.values)
new_df = pd.DataFrame(columns=new_cols)
def get_place(row, col):
place = df.columns[row == col].values
return place[0] if len(place) else ''
for col in new_cols:
new_df[col] = df.apply(lambda x:get_place(x, col), axis=1)
print(new_df)
Output:
old df
tom dick
Mon cafe lib
Tue lib gym
Wed gym cafe
new df
cafe gym lib
Mon tom dick
Tue dick tom
Wed dick tom
CodePudding user response:
Given this dataframe df
:
Tom Dick Harry
Mon Cafe Library Gym
Tue Library Gym Cafe
Wed Gym Cafe Library
You can get the outcome you want by
(i) unstack
the dataframe to get a MultiIndex Series
(ii) rearrange the order of index levels and values of this Series
(iii) unstack the Series back
s = df.unstack()
s = pd.Series(s.index.get_level_values(0).to_numpy(), index = pd.MultiIndex.from_tuples(zip(s.index.get_level_values(1), s)))
out = s.unstack()
Output:
Cafe Gym Library
Mon Tom Harry Dick
Tue Harry Dick Tom
Wed Dick Tom Harry
This example is a special case where in step (iii) works out fine because there is no duplicate index, however, in general case, you'll more likely have duplicate index.
For example, you could have the following dataframe:
Tom Dick Harry
Mon Cafe Library Gym
Tue Library Gym Cafe
Wed Library Cafe Library
where on Wednesday
both Tom
and Harry
go to the library
. In that case the above solution wouldn't work because we'll have duplicate indices after step (ii). But you can solve that issue by adding two more lines after step (ii) by aggregating over index:
s = df.unstack()
s = pd.Series(s.index.get_level_values(0).to_numpy(), index = pd.MultiIndex.from_tuples(zip(s.index.get_level_values(1), s)))
s = s.groupby(s.index).agg(list)
s.index = pd.MultiIndex.from_tuples(s.index)
out = s.unstack()
Output:
Cafe Gym Library
Mon [Tom] [Harry] [Dick]
Tue [Harry] [Dick] [Tom]
Wed [Dick] NaN [Tom, Harry]