Count combination of index pairs in a pivot table-CodePudding

I have a following question, hoping you can support me. Suppose you have following frame

df

  Name  Number  File_name  Frequency
0    A  item 1     path 1          2
1    A  item 1     path 2          2
2    A  item 2     path 1          4
3    A  item 2     path 2          4
4    A  item 3     path 1          1
5    A  item 2     path 3          4
6    A  item 2     path 4          4

my goal is to create a pivot table with following shape

df= Name      Number    File_name      Frequency   
    A         item 1     path 1            2 
                         path 2
              item 2     path 1            4 
                         path 2
                         path 3
                         path 4
              item 3     path 1            1

I tried the following way, since I do not know how to count the combination of an index pair for a multi-dimensional pivot table

df=pd.read_excel(path)
df["Unique ID"]=df["Name"] " " df["Number"] #Create an additional column to calculate frequency
df['frequency'] = df['Unique ID'].map(df['Unique ID'].value_counts())

pivot=pd.pivot_table(df,index=["Name",
                                "Number","File_name"],
                     values="frequency",fill_value=0)

print(pivot)

pivot= Name      Number    File_name      Frequency   
    A         item 1     path 1            2 
                         path 2            2
              item 2     path 1            4 
                         path 2            4
                         path 3            4
                         path 4            4
              item 3     path 1            1

My question is: How can I, should I change my code in order to get the desired outcome, i.e. not to have the frequency values as duplicates, but rather also as an index with unique values

CodePudding user response：

i start with this frame

In [485]: df
Out[485]:
  Name Number Filename
0    A  item1    path1
1    A  item1    path2
2    A  item2    path1
3    A  item2    path2
4    A  item3    path1
5    A  item2    path3
6    A  item2    path4

The desired frequency is groupby & transform("count"), similar to what you did

In [491]: g = df.groupby(["Name", "Number"])["Filename"]

In [492]: g.transform("count")
Out[492]:
0    2
1    2
2    4
3    4
4    1
5    4
6    4
Name: Filename, dtype: int64

Cool but need to mask the dupes here. And not globally but locally and only retain the first. So here's a trick: if i do g.cumcount(), it will give 0, 1, ... per group. Then I use 0's specialness to get the mask

In [493]: g.cumcount()
Out[493]:
0    0
1    1
2    0
3    1
4    0
5    2
6    3
dtype: int64

In [494]: ~g.cumcount().astype(bool)
Out[494]:
0     True
1    False
2     True
3    False
4     True
5    False
6    False
dtype: bool

Then i can multiply the counts with this, replace 0s (arising due to False's in multiplication) with empty string, set a multiindex & sort it and done:

In [496]: counts = g.transform("count")

In [497]: first_per_gr_mask = ~g.cumcount().astype(bool)

In [498]: freq = counts.mul(first_per_gr_mask).replace(0, "")

In [499]: freq
Out[499]:
0    2
1
2    4
3
4    1
5
6
dtype: object

In [500]: (df.assign(Frequency=freq)
     ...:    .set_index(["Name", "Number", "Filename"])
     ...:    .sort_index())
Out[500]:
                     Frequency
Name Number Filename
A    item1  path1            2
            path2
     item2  path1            4
            path2
            path3
            path4
     item3  path1            1

CodePudding user response：

What you want doesn't make a lot of sense, if you want something to act like an index... make it an index, otherwise you're just manually creating a non-functional, visually-pleasing, output. (Which Mustafa does a good job of executing.)

df['blank'] = np.nan
df = df.set_index(['Name', 'Number', 'Frequency', 'File_name']).sort_index()
print(df)

# Output:
                                 blank
Name Number Frequency File_name
A    item 1 2         path 1       NaN
                      path 2       NaN
     item 2 4         path 1       NaN
                      path 2       NaN
                      path 3       NaN
                      path 4       NaN
     item 3 1         path 1       NaN

CodePudding user response：

You could use:

df = df.sort_values(df.columns.drop('Frequency').tolist(), ignore_index=True)
df2 = (df == df.shift()).replace(True, '')
df2 = df2.mask(df2 != '').fillna(df)

print(df2):

  Name  Number File_name Frequency
0    A  item 1    path 1         2
1                 path 2          
2       item 2    path 1         4
3                 path 2          
4                 path 3          
5                 path 4          
6       item 3    path 1         1

Explanation:

a) df2 = (df == df.shift()) which gives:

    Name  Number  File_name  Frequency
0  False   False      False      False
1   True    True      False       True
2   True   False      False      False
3   True    True      False       True
4   True    True      False       True
5   True    True      False       True
6   True   False      False      False

b) Replace True values with empty string ''

df2 = df2.replace(True, '') which gives:

    Name Number  File_name Frequency
0  False  False      False     False
1                    False          
2         False      False     False
3                    False          
4                    False          
5                    False          
6         False      False     False

c) Mask non-empty values with NaN

df2.mask(df2 != '')

  Name Number  File_name Frequency
0  NaN    NaN        NaN       NaN
1                    NaN          
2         NaN        NaN       NaN
3                    NaN          
4                    NaN          
5                    NaN          
6         NaN        NaN       NaN

d) fillna of this df with the original one

df2.fillna(df)

  Name  Number File_name Frequency
0    A  item 1    path 1         2
1                 path 2          
2       item 2    path 1         4
3                 path 2          
4                 path 3          
5                 path 4          
6       item 3    path 1         1