An extract from my df (35k rows in total):
stop_id time
7909 2022-04-06T03:47:00 03:00
7909 2022-04-06T04:07:00 03:00
1009413 2022-04-06T04:10:00 03:00
1002246 2022-04-06T04:19:00 03:00
1009896 2022-04-06T04:20:00 03:00
I want to create separate dataframes for every unique stop_id
column, in each dataframe I need to have stop_id
(which is constant for every row), time
field with only unique values and number
column which aggregates the rows with the same time
value. So, assuming there are 50 unique stop_id
values, I want to get 50 separate csv files containing all the data above. How can I do this?
hope the explanation doesn't seem messy
I have this line of code
df.groupby(['time']).agg({'time':'size','stop_id': ", ".join})
but it doesn't keep the value of stop_id
expected output: csv1
stop_id time number
7909 2022-04-06T03:47:00 03:00 1
7909 2022-04-06T04:07:00 03:00 1
7909 2022-04-06T05:00:00 03:00 2
...
csv2
stop_id time number
1009413 2022-04-06T04:10:00 03:00 1
1009413 2022-04-06T04:19:00 03:00 3
1009413 2022-04-06T04:30:00 03:00 5
...
CodePudding user response:
You can use a group_by
on both stop_id
and time
and use the size()
aggregation to get what will be the number
column in each Data Frame. After that, you can filter all the unique stop_id
and iterate over each group to construct the single Data Frame as follows:
import pandas as pd
data = {"stop_id": [...], "time": [...]} # Your Data
df = pd.DataFrame(data=data) # Create the DataFrame from the data
# The GroupBy DataFrame has the MultiIndex with the form (stop_id, time)
g = df.groupby(['stop_id', 'time']).size()
# Set of stop_ids, you can also use df.stop_id.unique()
stops = { i[0] for i in g.index }
# Iterate over every unique stop_id
for stop in stops:
# Filter only the groups with the right stop_id
times = filter(lambda x: x[0] == stop, g.index)
# Prepare new DataFrame
data = { "stop_id": [], "time": [], "number": []}
# Iterate over each unique time for the specific stop_id
for time in times:
data["stop_id"].append(stop) # add the stop_id
data["time"].append(time[1]) # add the current time
data["number"].append(g[(stop, time[1])]) # add its count
# Save the DataFrame as a CSV
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
Edit to address comment
If I understood correctly, instead of number you now want to have a list of elements that in the previous script are the count. This is possible thanks to the apply()
method, used in the following way:
import pandas as pd
data = {"stop_id": [...], "route_name": [...], "time": [...]}
df = pd.DataFrame(data=data)
# The GroupBy DataFrame has the tuple (stop_id, time) as Indexes
# Apply the list() function over the values of "route_name" in the group
g = df.groupby(['stop_id', 'time'])["route_name"].apply(list)
print(g)
# Set of stop_ids
stops = { i[0] for i in g.index }
print(g)
for stop in stops:
times = filter(lambda x: x[0] == stop, g.index)
data = { "stop_id": [], "time": [], "route_names": []}
for time in times:
data["stop_id"].append(stop)
data["time"].append(time[1])
data["route_names"].append(g[(stop, time[1])])
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
Pay attention to the fact that in this way, if you read one of the produced CSV into a Pandas Data Frame, you have to convert the field route_names
from a String to a List. Here you can find some approaches to perform this operation.