Having this dataframe at hand:
data = {'user': [7, 7, 7, 7, 7, 7, 7, 11, 11, 11],
'session_id': [15, 15, 15, 15, 31, 31, 31, 43, 43, 43],
'logtime': ['2016-04-13 07:58:40','2016-04-13 07:58:41','2016-04-13 07:58:42',
'2016-04-13 07:58:43','2016-04-01 20:29:37','2016-04-01 20:29:42',
'2016-04-01 20:29:47','2016-03-30 06:21:59','2016-03-30 06:22:04',
'2016-03-30 06:22:09'],
'lat': [41.1872084,41.1870716,41.1869719,41.1868664,41.1471521,
41.1472466,41.1473038,41.2372125,41.2371444,41.2369725],
'lon': [-8.6038931,-8.6037318,-8.6036908,-8.6036423,-8.5878757,
-8.5874314,-8.586632,-8.6720773,-8.6721269,-8.6718833]}
d = pd.DataFrame(data)
d
user session_id logtime lat lon
0 7 15 2016-04-13 07:58:40 41.187208 -8.603893
1 7 15 2016-04-13 07:58:41 41.187072 -8.603732
2 7 15 2016-04-13 07:58:42 41.186972 -8.603691
3 7 15 2016-04-13 07:58:43 41.186866 -8.603642
4 7 31 2016-04-01 20:29:37 41.147152 -8.587876
5 7 31 2016-04-01 20:29:42 41.147247 -8.587431
6 7 31 2016-04-01 20:29:47 41.147304 -8.586632
7 11 43 2016-03-30 06:21:59 41.237212 -8.672077
8 11 43 2016-03-30 06:22:04 41.237144 -8.672127
9 11 43 2016-03-30 06:22:09 41.236973 -8.671883
And I want to:
Create a sub-directory (in current working dir), for each user.
Within each user's sub-directory, I would create 1
CSV
file for each session of that user.Write to each file, session's
logtime, lat, lon
(without session ID), named these files in the formatfile1.csv, file2.csv
etc.Then next user, until all users.
Expected output
So that the final directory structure and file contents is in the form (showing file content):
Data/
├── 11
│ └── file1.csv
| logtime,lat,lon
| 2016-03-30 06:21:59,41.2372125,-8.6720773
| 2016-03-30 06:22:04,41.2371444,-8.6721269
| 2016-03-30 06:22:09,41.2369725,-8.6718833
└── 7
├── file1.csv
| logtime,lat,lon
| 2016-04-13 07:58:40,41.187208,-8.603893
| 2016-04-13 07:58:41,41.187072,-8.603732
| 2016-04-13 07:58:42,41.186972,-8.603691
| 2016-04-13 07:58:43,41.186866,-8.603642
└── file2.csv
logtime,lat,lon
2016-04-01 20:29:37,41.147152,-8.587876
2016-04-01 20:29:42,41.147247,-8.587431
2016-04-01 20:29:47,41.147304,-8.586632
CodePudding user response:
This could be done with os.makedirs
and groupby
:
import os
# make the data folder if needed, change the path if needed
base_folder = '/Data'
os.makedirs(base_folder, exist_ok=True)
for (user_id,sess_id), data in df.groupby(['user', 'session_id']):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
filename = f'{user_fodler}/file_{session_id}.csv'
data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)
Note this will save file under session_id
. If you want to name as you wanted, then you can do two groupby; something like this:
for user_id, user_data in df.groupby('user'):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
for file_id, (sess_id, data) in user_data.groupby('session_id'):
filname = f'{user_folder}/file_{file_id}.csv'
....
CodePudding user response:
Another possible solution:
# Create folders, assuming current working directory as root
for folder in d['user'].unique():
os.makedirs(str(folder), exist_ok=True)
((d.groupby('user')
.apply(lambda x: (x.assign(id = x.groupby('session_id').ngroup() 1))))
.groupby(['user', 'session_id'])
.apply(lambda y: y.iloc[:, 2:(len(y.columns)-1)]
.to_csv(os.path.join(
os.getcwd(), str(y['user'].unique()[0]),
f'file{str(y.id.unique()[0])}.csv'), index=False)))