How to get the max of all the named categories in a pandas table-CodePudding

So imagine I have data like this:

  name  time
0    A     1
1    A     2
2    B     3
3    A     6
4    A     7
5    A     3
6    B     1
7    B     4

Each entry has a named category and some other information. In the example above lets take time. Its the only one i care about.

I would like to produce a table which has just the individual unique name categopries and the max of each. I can do something like this:

max_table = pd.DataFrame(
    {
        "name": data.name.unique(),
        "max_val": [
            data[data["name"] == name].time.max() for name in data.name.unique()
        ],
    }
)

But this does not feel very pandas-y. I have to go to and from the pandas table between lists and do some array expansion to make this work. Is there a way to do this with just a pandas type call?

Full example including data creation:

    data = pd.DataFrame(
        {
            "name": pd.Categorical(["A", "A", "B", "A", "A", "A", "B", "B"]),
            "time": [1, 2, 3, 6, 7, 3, 1, 4],
        }
    )
    print(data)
    print("======================")

    max_table = pd.DataFrame(
        {
            "name": data.name.unique(),
            "max_val": [
                data[data["name"] == name].time.max() for name in data.name.unique()
            ],
        }
    )
    print(max_table)

CodePudding user response：

The first thing to get to know about Pandas or Numpy is Vectorized manipulation(like Matlab dealing data with matrix/arrays). Which can process multiple items of data at once instead of for loop operation.

groupby() clustering data into groups, and max() find the maximum value of each group.

output = df.groupby('name').max()
output
###
      time
name      
A        7
B        4

The remaining part is just to reconstruct the table(DataFrame) structure.

(As you can see, name is lower than time, which means name is set to be index of the table(DataFrame), and time is a column's name of the table)

output = output.reset_index().rename(columns={'time':'max_time'})
output
###
  name  max_time
0    A         7
1    B         4

CodePudding user response：

A bit cleaner:

out = df.groupby('name', as_index=False)['time'].agg({'max_time': 'max'})
print(out)

Output:

  name  max_time
0    A         7
1    B         4

CodePudding user response：

output = data.groupby('name')['time'].max()

Output:

Name
A  7
B  4