I have dataset which contains 6.2M records. When iam splitting it by grouping, it looses about 1.2M records. This is fraction of data set:
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count |
1 | 2020-01-01 00:28:15 | 2020-01-01 00:33:03 | 1.0
1 | 2020-01-01 00:35:39 | 2020-01-01 00:43:04 | 1.0
.. |.................... | ................... | ...
1 | 2020-01-31 00:47:41 | 2020-01-31 00:53:52 | 1.0
1 | 2020-01-31 00:55:23 | 2020-01-31 01:00:14 | 1.0
2 | 2020-01-31 00:01:58 | 2020-01-31 00:04:16 | 1.0
I need to split it in column tpep_dropoff_datetime
by days. This is code i use to do that, but as i mentioned before, it's not working properly.
for date, g in df.groupby(pd.to_datetime(df['tpep_dropoff_datetime']).dt.normalize().astype(str)):
g.to_csv(f'{date}.csv', index=False)
Any ideas, how to split dataframe ?
CodePudding user response:
you could try this, though I believe it might not be the best way around (Pandas might have a better way for doing this).
import pandas as pd
cols = ["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count"]
df = pd.DataFrame(
[[1, "2020-01-01 00:28:15", "2020-01-01 00:33:03", 1.0],
[1, "2020-01-01 00:35:39", "2020-01-01 00:43:04", 1.0],
[1, "2020-01-31 00:47:41", "2020-01-31 00:53:52", 1.0],
[1, "2020-01-31 00:55:23", "2020-01-31 01:00:14", 1.0],
],
columns=cols,
)
# I do this because of the example the date is a string and I'm changing it to datetime.
# This might not be necesary, depends on your data.
df["tpep_dropoff_datetime"] = pd.to_datetime(df['tpep_dropoff_datetime'], format="%Y-%m-%d %H:%M:%S")
# Create a new column named "my_date" which
# will contains the date from the column "tpep_dropoff_datetime"
df["my_date"] = df["tpep_dropoff_datetime"].dt.date
# Now we group by date al the rows, and copy the ones according to their index
for date, indexes in df.groupby('my_date').groups.items():
print(f"date: {date}")
print(f"indexes: {indexes}")
# Copying the rows I want according to the index
aux_df = df.loc[indexes]
print(aux_df)
# Exporting to csv only the columns I want
aux_df.to_csv(f"{date}.csv", columns=cols, index=False)
The output are the files and this one in the console:
date: 2020-01-01
indexes: Int64Index([0, 1], dtype='int64')
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count my_date
0 1 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 2020-01-01
1 1 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 2020-01-01
date: 2020-01-31
indexes: Int64Index([2, 3], dtype='int64')
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count my_date
2 1 2020-01-31 00:47:41 2020-01-31 00:53:52 1.0 2020-01-31
3 1 2020-01-31 00:55:23 2020-01-31 01:00:14 1.0 2020-01-31
With this at least I would be sure that I get the date right, but probably is not the best in efficiency