So imagine I have data like this:
name time
0 A 1
1 A 2
2 B 3
3 A 6
4 A 7
5 A 3
6 B 1
7 B 4
Each entry has a named category and some other information. In the example above lets take time. Its the only one i care about.
I would like to produce a table which has just the individual unique name
categopries and the max of each. I can do something like this:
max_table = pd.DataFrame(
{
"name": data.name.unique(),
"max_val": [
data[data["name"] == name].time.max() for name in data.name.unique()
],
}
)
But this does not feel very pandas-y. I have to go to and from the pandas table between lists and do some array expansion to make this work. Is there a way to do this with just a pandas type call?
Full example including data creation:
data = pd.DataFrame(
{
"name": pd.Categorical(["A", "A", "B", "A", "A", "A", "B", "B"]),
"time": [1, 2, 3, 6, 7, 3, 1, 4],
}
)
print(data)
print("======================")
max_table = pd.DataFrame(
{
"name": data.name.unique(),
"max_val": [
data[data["name"] == name].time.max() for name in data.name.unique()
],
}
)
print(max_table)
CodePudding user response:
The first thing to get to know about Pandas
or Numpy
is Vectorized manipulation(like Matlab dealing data with matrix/arrays). Which can process multiple items of data at once instead of for
loop operation.
groupby()
clustering data into groups, and max()
find the maximum value of each group.
output = df.groupby('name').max()
output
###
time
name
A 7
B 4
The remaining part is just to reconstruct the table(DataFrame) structure.
(As you can see, name
is lower than time
, which means name
is set to be index of the table(DataFrame), and time
is a column's name of the table)
output = output.reset_index().rename(columns={'time':'max_time'})
output
###
name max_time
0 A 7
1 B 4
CodePudding user response:
A bit cleaner:
out = df.groupby('name', as_index=False)['time'].agg({'max_time': 'max'})
print(out)
Output:
name max_time
0 A 7
1 B 4
CodePudding user response:
output = data.groupby('name')['time'].max()
Output:
Name
A 7
B 4