Home > Enterprise >  create and save dataframes from other dataframe
create and save dataframes from other dataframe

Time:05-25

An extract from my df (35k rows in total):

stop_id                      time
7909    2022-04-06T03:47:00 03:00
7909    2022-04-06T04:07:00 03:00
1009413 2022-04-06T04:10:00 03:00
1002246 2022-04-06T04:19:00 03:00
1009896 2022-04-06T04:20:00 03:00

I want to create separate dataframes for every unique stop_id column, in each dataframe I need to have stop_id (which is constant for every row), time field with only unique values and number column which aggregates the rows with the same time value. So, assuming there are 50 unique stop_id values, I want to get 50 separate csv files containing all the data above. How can I do this?

hope the explanation doesn't seem messy

I have this line of code

df.groupby(['time']).agg({'time':'size','stop_id': ", ".join})

but it doesn't keep the value of stop_id

expected output: csv1

stop_id   time                      number   
7909      2022-04-06T03:47:00 03:00 1
7909      2022-04-06T04:07:00 03:00 1
7909      2022-04-06T05:00:00 03:00 2
...

csv2

stop_id      time                      number   
1009413      2022-04-06T04:10:00 03:00 1
1009413      2022-04-06T04:19:00 03:00 3
1009413      2022-04-06T04:30:00 03:00 5
...

CodePudding user response:

You can use a group_by on both stop_id and time and use the size() aggregation to get what will be the number column in each Data Frame. After that, you can filter all the unique stop_id and iterate over each group to construct the single Data Frame as follows:

import pandas as pd


data = {"stop_id": [...], "time": [...]} # Your Data

df = pd.DataFrame(data=data) # Create the DataFrame from the data

# The GroupBy DataFrame has the MultiIndex with the form (stop_id, time)
g = df.groupby(['stop_id', 'time']).size()

# Set of stop_ids, you can also use df.stop_id.unique()
stops = { i[0] for i in g.index }

# Iterate over every unique stop_id
for stop in stops:
    # Filter only the groups with the right stop_id
    times = filter(lambda x: x[0] == stop, g.index)
    
    # Prepare new DataFrame
    data = { "stop_id": [], "time": [], "number": []}
    
    # Iterate over each unique time for the specific stop_id
    for time in times:
        data["stop_id"].append(stop) # add the stop_id
        data["time"].append(time[1]) # add the current time
        data["number"].append(g[(stop, time[1])]) # add its count

    # Save the DataFrame as a CSV
    pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)

Edit to address comment

If I understood correctly, instead of number you now want to have a list of elements that in the previous script are the count. This is possible thanks to the apply() method, used in the following way:

import pandas as pd


data = {"stop_id": [...], "route_name": [...], "time": [...]}

df = pd.DataFrame(data=data)

# The GroupBy DataFrame has the tuple (stop_id, time) as Indexes
# Apply the list() function over the values of "route_name" in the group
g = df.groupby(['stop_id', 'time'])["route_name"].apply(list)
print(g)

# Set of stop_ids
stops = { i[0] for i in g.index }
print(g)
for stop in stops:
    times = filter(lambda x: x[0] == stop, g.index)
    data = { "stop_id": [], "time": [], "route_names": []}
    for time in times:
        data["stop_id"].append(stop)
        data["time"].append(time[1])
        data["route_names"].append(g[(stop, time[1])])
    pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)

Pay attention to the fact that in this way, if you read one of the produced CSV into a Pandas Data Frame, you have to convert the field route_names from a String to a List. Here you can find some approaches to perform this operation.

  • Related