I have a following question, hoping you can support me. Suppose you have following frame
df
Name Number File_name Frequency
0 A item 1 path 1 2
1 A item 1 path 2 2
2 A item 2 path 1 4
3 A item 2 path 2 4
4 A item 3 path 1 1
5 A item 2 path 3 4
6 A item 2 path 4 4
my goal is to create a pivot table with following shape
df= Name Number File_name Frequency
A item 1 path 1 2
path 2
item 2 path 1 4
path 2
path 3
path 4
item 3 path 1 1
I tried the following way, since I do not know how to count the combination of an index pair for a multi-dimensional pivot table
df=pd.read_excel(path)
df["Unique ID"]=df["Name"] " " df["Number"] #Create an additional column to calculate frequency
df['frequency'] = df['Unique ID'].map(df['Unique ID'].value_counts())
pivot=pd.pivot_table(df,index=["Name",
"Number","File_name"],
values="frequency",fill_value=0)
print(pivot)
pivot= Name Number File_name Frequency
A item 1 path 1 2
path 2 2
item 2 path 1 4
path 2 4
path 3 4
path 4 4
item 3 path 1 1
My question is: How can I, should I change my code in order to get the desired outcome, i.e. not to have the frequency values as duplicates, but rather also as an index with unique values
CodePudding user response:
i start with this frame
In [485]: df
Out[485]:
Name Number Filename
0 A item1 path1
1 A item1 path2
2 A item2 path1
3 A item2 path2
4 A item3 path1
5 A item2 path3
6 A item2 path4
The desired frequency is groupby & transform("count"), similar to what you did
In [491]: g = df.groupby(["Name", "Number"])["Filename"]
In [492]: g.transform("count")
Out[492]:
0 2
1 2
2 4
3 4
4 1
5 4
6 4
Name: Filename, dtype: int64
Cool but need to mask the dupes here. And not globally but locally and only retain the first. So here's a trick: if i do g.cumcount()
, it will give 0, 1, ...
per group. Then I use 0's specialness to get the mask
In [493]: g.cumcount()
Out[493]:
0 0
1 1
2 0
3 1
4 0
5 2
6 3
dtype: int64
In [494]: ~g.cumcount().astype(bool)
Out[494]:
0 True
1 False
2 True
3 False
4 True
5 False
6 False
dtype: bool
Then i can multiply the counts with this, replace 0s (arising due to False's in multiplication) with empty string, set a multiindex & sort it and done:
In [496]: counts = g.transform("count")
In [497]: first_per_gr_mask = ~g.cumcount().astype(bool)
In [498]: freq = counts.mul(first_per_gr_mask).replace(0, "")
In [499]: freq
Out[499]:
0 2
1
2 4
3
4 1
5
6
dtype: object
In [500]: (df.assign(Frequency=freq)
...: .set_index(["Name", "Number", "Filename"])
...: .sort_index())
Out[500]:
Frequency
Name Number Filename
A item1 path1 2
path2
item2 path1 4
path2
path3
path4
item3 path1 1
CodePudding user response:
What you want doesn't make a lot of sense, if you want something to act like an index... make it an index, otherwise you're just manually creating a non-functional, visually-pleasing, output. (Which Mustafa does a good job of executing.)
df['blank'] = np.nan
df = df.set_index(['Name', 'Number', 'Frequency', 'File_name']).sort_index()
print(df)
# Output:
blank
Name Number Frequency File_name
A item 1 2 path 1 NaN
path 2 NaN
item 2 4 path 1 NaN
path 2 NaN
path 3 NaN
path 4 NaN
item 3 1 path 1 NaN
CodePudding user response:
You could use:
df = df.sort_values(df.columns.drop('Frequency').tolist(), ignore_index=True)
df2 = (df == df.shift()).replace(True, '')
df2 = df2.mask(df2 != '').fillna(df)
print(df2):
Name Number File_name Frequency
0 A item 1 path 1 2
1 path 2
2 item 2 path 1 4
3 path 2
4 path 3
5 path 4
6 item 3 path 1 1
Explanation:
a) df2 = (df == df.shift())
which gives:
Name Number File_name Frequency
0 False False False False
1 True True False True
2 True False False False
3 True True False True
4 True True False True
5 True True False True
6 True False False False
b) Replace True
values with empty string ''
df2 = df2.replace(True, '')
which gives:
Name Number File_name Frequency
0 False False False False
1 False
2 False False False
3 False
4 False
5 False
6 False False False
c) Mask non-empty values with NaN
df2.mask(df2 != '')
Name Number File_name Frequency
0 NaN NaN NaN NaN
1 NaN
2 NaN NaN NaN
3 NaN
4 NaN
5 NaN
6 NaN NaN NaN
d) fillna of this df with the original one
df2.fillna(df)
Name Number File_name Frequency
0 A item 1 path 1 2
1 path 2
2 item 2 path 1 4
3 path 2
4 path 3
5 path 4
6 item 3 path 1 1