I have a dataframe that looks like this:
Item Year
I1 2015
I2 2016
I1 2017
I2 2014
Item I2
was sold in 2016 and 2014 for example
I want to have group by Item
and Year
and then do what this R
code does:
top_items <- data %>% select(Item, Year) %>%
group_by(Year, Item) %>%
summarize(sales_trend = n()) %>%
arrange(desc(sales_trend))
Meaning, I have to have an output of the top bought items sorted.
I am trying this Python code:
b_data = pd.DataFrame(data[["Item", "Year"]].groupby(["Item", "Year"]).size()).sort_values(by=[0], ascending=False)
But I get an additional column 0
, and I want to sort by it, but I don't want the column to be called 0
, how to have it called sales_trend
like the one in R
.
And also, if I want to get this Python
equivalence of the following R
that completes the previous, how to do it?
...
arrange(desc(sales_trend))
slice_head(n = 5) %>%
mutate(Year = as.integer(Year), rank = 1:5) %>%
select(-sales_trend)
CodePudding user response:
df.groupby("Year")["Item"].value_counts().sort_values(ascending=True)
CodePudding user response:
With datar
, a pandas wrapper that reimagines pandas APIs, we are able to translate your R code in python:
>>> from datar.all import c, f, tibble, select, group_by, summarize, arrange, desc, n
>>>
>>> data = tibble(Item=c("I1", "I2", "I1", "I2", "I2"), Year=c(2015, 2016, 2017, 2014, 2014))
>>> data
Item Year
<object> <int64>
0 I1 2015
1 I2 2016
2 I1 2017
3 I2 2014
4 I2 2014 # add one more item to see if it pops up at the top
>>> top_items = (
... data
... >> select(f.Item, f.Year)
... >> group_by(f.Year, f.Item)
... >> summarize(sales_trend=n())
... >> arrange(desc(f.sales_trend))
... )
[2022-03-17 10:15:54][datar][ INFO] `summarise()` has grouped output by ['Year'] (override with `_groups` argu
ment)
>>> top_items
Year Item sales_trend
<int64> <object> <int64>
0 2014 I2 2
1 2015 I1 1
2 2016 I2 1
3 2017 I1 1
[TibbleGrouped: Year (n=4)]