How to split dataframe column by row-CodePudding

I have dataset which contains 6.2M records. When iam splitting it by grouping, it looses about 1.2M records. This is fraction of data set:

VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count |
1        | 2020-01-01 00:28:15  | 2020-01-01 00:33:03   |  1.0  
1        | 2020-01-01 00:35:39  | 2020-01-01 00:43:04   |  1.0  
..       |....................  | ...................   |  ...
1        | 2020-01-31 00:47:41  | 2020-01-31 00:53:52   |  1.0  
1        | 2020-01-31 00:55:23  | 2020-01-31 01:00:14   |  1.0  
2        | 2020-01-31 00:01:58  | 2020-01-31 00:04:16   |  1.0

I need to split it in column tpep_dropoff_datetime by days. This is code i use to do that, but as i mentioned before, it's not working properly.

for date, g in df.groupby(pd.to_datetime(df['tpep_dropoff_datetime']).dt.normalize().astype(str)):
    g.to_csv(f'{date}.csv', index=False)

Any ideas, how to split dataframe ?

CodePudding user response：

you could try this, though I believe it might not be the best way around (Pandas might have a better way for doing this).

import pandas as pd
cols = ["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count"]
df = pd.DataFrame(
    [[1, "2020-01-01 00:28:15", "2020-01-01 00:33:03", 1.0],
     [1, "2020-01-01 00:35:39", "2020-01-01 00:43:04", 1.0],
     [1, "2020-01-31 00:47:41", "2020-01-31 00:53:52", 1.0],
     [1, "2020-01-31 00:55:23", "2020-01-31 01:00:14", 1.0],
    ],
    columns=cols,
)
# I do this because of the example the date is a string and I'm changing it to datetime.
# This might not be necesary, depends on your data.
df["tpep_dropoff_datetime"] = pd.to_datetime(df['tpep_dropoff_datetime'], format="%Y-%m-%d %H:%M:%S")
# Create a new column named "my_date" which
# will contains the date from the column "tpep_dropoff_datetime"
df["my_date"] = df["tpep_dropoff_datetime"].dt.date

# Now we group by date al the rows, and copy the ones according to their index
for date, indexes in df.groupby('my_date').groups.items():
    print(f"date: {date}")
    print(f"indexes: {indexes}")
    # Copying the rows I want according to the index
    aux_df = df.loc[indexes]
    print(aux_df)
    # Exporting to csv only the columns I want
    aux_df.to_csv(f"{date}.csv", columns=cols, index=False)

The output are the files and this one in the console:

date: 2020-01-01
indexes: Int64Index([0, 1], dtype='int64')
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count     my_date
0         1  2020-01-01 00:28:15   2020-01-01 00:33:03              1.0  2020-01-01
1         1  2020-01-01 00:35:39   2020-01-01 00:43:04              1.0  2020-01-01
date: 2020-01-31
indexes: Int64Index([2, 3], dtype='int64')
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count     my_date
2         1  2020-01-31 00:47:41   2020-01-31 00:53:52              1.0  2020-01-31
3         1  2020-01-31 00:55:23   2020-01-31 01:00:14              1.0  2020-01-31

With this at least I would be sure that I get the date right, but probably is not the best in efficiency