Date | Items | Location |
---|---|---|
2022-01-01 | Item1 | Location1 |
2022-01-01 | Item2 | Location1 |
2022-01-02 | Item1 | Location3 |
2022-01-03 | Item3 | Location1 |
2022-02-01 | Item1 | Location2 |
... | ... | ... |
I have a pandas dataframe thats similar to the table above. How do I create a new data frame with the Items column grouped and made into the index and the Date column organized into specific months with each month as a column? The numbers inside the dataframe will be the number of occurrences of each item for each month.
I'll need to change the date to months and change the index, but I'm not too sure how to do it exactly.
The table of the table I want to create is something like the one below:
- | Date | |
---|---|---|
Items | Jan | Feb |
Item1 | 5 | 3 |
Item2 | 3 | 2 |
Item3 | 3 | 5 |
... | ... | ... |
CodePudding user response:
Looks like you want a crosstab
using periods of your dates:
out = pd.crosstab(df['Items'], pd.PeriodIndex(df['Date'], freq='M'))
output:
col_0 2022-01 2022-02
Items
Item1 2 1
Item2 1 0
Item3 1 0
CodePudding user response:
What you want is called a pivot. In order for the months to appear in the correct order (Jan, Feb, Mar, ...) and not alphabetical order (Apr, Aug, Dec, ...), you also need to convert the month names to an ordered categorical type:
# Generate the month names Jan, Feb, Mar, ... and indicate
# that they have an intrinsic order
MonthNameDtype = pd.CategoricalDtype(pd.date_range("2022-01-01", freq="MS", periods=12).strftime("%b"), ordered=True)
# Extract the month name from the dates
df["Month"] = df["Date"].dt.strftime("%b").astype(MonthNameDtype)
# Pivot
df.pivot_table(index="Items", columns="Month", values="Location", aggfunc="count")
CodePudding user response:
Add a month column to your dataframe :
df['month'] = df['Date'].dt.strftime('%b')
Then the following :
res = (df
.groupby(["Items", "month"])
.size()
.reset_index()
.pivot(columns="month", index="Items")
)
should output the desired result :
month Feb Jan
Items
Item1 1.0 2.0
Item2 NaN 1.0
Item3 NaN 1.0
CodePudding user response:
Using the link from OP's GH a complete solution using Grouper
and pivot
:
df = pd.read_csv(
"https://raw.githubusercontent.com/yongkheehou/recyclegowheredata/main/random_generated_website_usage_data.csv",
parse_dates=["Date"],
index_col="Date"
)
g = (
df.groupby([pd.Grouper(freq="M"), "Item"])
.count().reset_index()
.pivot(columns="Date", index="Item", values="Bin Location")
.fillna(0)
.astype(int)
)
It seems right:
>>> g.iloc[:5, :5]
Date 2021-01-31 2021-02-28 2021-03-31 2021-04-30 2021-05-31
Item
Aerosol Can 11 14 10 11 15
Air-Conditioners 1 2 0 0 2
Backpacks 0 0 1 0 0
Baking Glassware 0 0 0 0 0
Bedding 0 0 2 2 1
This answer is exactly the same as @mozway's https://stackoverflow.com/a/73473157/837451 shorter and more elegant one :) except for the fact that groupby.count doesn't count NaNs. Since "Bin Location" in the original data has nans, it's not identical. If counting an always non-nan column like "s/n" the results are identical. Which answer is correct depends on your use case.