Following my previous question on dynamicallay creating sub-folders and writing files to sub-folders, I realised I needed further help, having applied to real dataset.
Suppose this is the dataframe that I have:
data = {'user': [7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 11, 11, 11],
'session_id': [15, 15, 15, 15, 31, 31, 31, 20, 20,
20, 43, 43, 43],
'logtime': ['2016-04-13 07:58:40', '2016-04-13 07:58:41','2016-04-13 07:58:42',
'2016-04-13 07:58:43', '2016-04-01 20:29:37', '2016-04-01 20:29:42',
'2016-04-01 20:29:47', '2016-04-13 21:04:27', '2016-04-13 21:04:28',
'2016-04-13 21:04:29', '2016-03-30 06:21:59', '2016-03-30 06:22:04',
'2016-03-30 06:22:09'],
'lat': [41.1872084, 41.1870716, 41.1869719, 41.1868664, 41.1471521, 41.1472466,
41.1473038, 40.0121007, 40.0121084, 40.0121534, 41.2372125, 41.2371444, 41.2369725],
'lon': [-8.6038931, -8.6037318, -8.6036908, -8.6036423, -8.5878757, -8.5874314, -8.586632,
-8.5992162, -8.5992024, -8.5991788, -8.6720773, -8.6721269, -8.6718833]}
df = pd.DataFrame(data)
df
user session_id logtime lat lon
0 7 15 2016-04-13 07:58:40 41.187208 -8.603893
1 7 15 2016-04-13 07:58:41 41.187072 -8.603732
2 7 15 2016-04-13 07:58:42 41.186972 -8.603691
3 7 15 2016-04-13 07:58:43 41.186866 -8.603642
4 7 31 2016-04-01 20:29:37 41.147152 -8.587876
5 7 31 2016-04-01 20:29:42 41.147247 -8.587431
6 7 31 2016-04-01 20:29:47 41.147304 -8.586632
7 7 20 2016-04-13 21:04:27 40.012101 -8.599216
8 7 20 2016-04-13 21:04:28 40.012108 -8.599202
9 7 20 2016-04-13 21:04:29 40.012153 -8.599179
10 11 43 2016-03-30 06:21:59 41.237212 -8.672077
11 11 43 2016-03-30 06:22:04 41.237144 -8.672127
12 11 43 2016-03-30 06:22:09 41.236973 -8.671883
I am re-organising this dataframe by creating sub-folder for each user. And then creating a CSV
file in that user sub-folder. The file should contain user's session log,logtime, lat, lon
. However, I want to write sessions I user covered in the same day into 1 file like file1.csv
.
This answer to my previous question is fine:
import os
import pandas as pd
df = pd.read_csv('mydata.csv', parse_dates=['logtime'])
# make base dir
base_folder = 'Data'
os.makedirs(base_folder, exist_ok=True)
for user_id, user_data in df.groupby('user'):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
for file_id, (sess_id, data) in enumerate(user_data.groupby(['session_id'])):
filename = f'{user_folder}/file{file_id 1}.csv'
data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)
Which creates 1 file for each session of a user, resulting in dir structure:
$ tree Data/
Data/
├── 11
│ └── file1.csv
└── 7
├── file1.csv
├── file2.csv
└── file3.csv
2 directories, 4 files
File contents:
$ cat Data/7/file1.csv
logtime,lat,lon
2016-04-13 07:58:40,41.1872084,-8.6038931
2016-04-13 07:58:41,41.1870716,-8.6037318
2016-04-13 07:58:42,41.1869719,-8.6036908
2016-04-13 07:58:43,41.1868664,-8.6036423
$ cat Data/7/file2.csv
logtime,lat,lon
2016-04-13 21:04:27,40.0121007,-8.5992162
2016-04-13 21:04:28,40.0121084,-8.5992024
2016-04-13 21:04:29,40.0121534,-8.5991788
$ cat Data/7/file3.csv
logtime,lat,lon
2016-04-01 20:29:37,41.1471521,-8.5878757
2016-04-01 20:29:42,41.1472466,-8.5874314
2016-04-01 20:29:47,41.1473038,-8.586632
Since session 15
and 20
of user 7
were covered in one day, I would combined these into file1.csv
altogether, maintaining the time order (logs of session 15
then 20
).
To do this, I modified the code above to:
for ..
# now group by 'session_id', 'logtime'
for file_id, (sess_id, data) in enumerate(user_data.groupby(['session_id', 'logtime'])):
filename = f'{user_folder}/file{file_id 1}.csv'
data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)
Giving:
$ tree Data/
Data/
├── 11
│ ├── file1.csv
│ ├── file2.csv
│ └── file3.csv
└── 7
├── file10.csv
├── file1.csv
├── file2.csv
├── file3.csv
├── file4.csv
├── file5.csv
├── file6.csv
├── file7.csv
├── file8.csv
└── file9.csv
2 directories, 13 files
Creating 1 file for each row of user 7
.
Required:
$ tree Data/
Data/
├── 11
│ └── file1.csv
| | 2016-03-30 06:21:59,41.237212,-8.672077
| | 2016-03-30 06:22:04,41.237144,-8.672127
| | 2016-03-30 06:22:09,41.236973,-8.671883
└── 7
├── file1.csv
| 2016-04-13 07:58:40,41.187208,-8.603893
| 2016-04-13 07:58:41,41.187072,-8.603732
| 2016-04-13 07:58:42,41.186972,-8.603691
| 2016-04-13 07:58:43,41.186866,-8.603642
| 2016-04-13 21:04:27,40.012101,-8.599216
| 2016-04-13 21:04:28,40.012108,-8.599202
| 2016-04-13 21:04:29,40.012153,-8.599179
└── file2.csv
2016-04-01 20:29:37,41.147152,-8.587876
2016-04-01 20:29:42,41.147247,-8.587431
2016-04-01 20:29:47,41.147304,-8.586632
2 directories, 3 files
CodePudding user response:
This should solve your problem. You can't combine two sessions at one day because you groupby session_id
so in each iteration you only can handle one session. Since you are only interested in sessions that took place at the same day, groupby the day in the 2nd step.
base_folder = 'Data'
os.makedirs(base_folder, exist_ok=True)
for user_id, user_data in df.groupby('user'):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
for file_id, (sess_id, data) in enumerate(user_data.groupby(user_data['logtime'].dt.date)):
filename = f'{user_folder}/file{file_id 1}.csv'
data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)