Home > other >  Python large text processing
Python large text processing

Time:09-27

Programming small white one, bosses guidance! Thank you very much!
I now have a taxi GPS data, stored in 25 text file, each text file around 2 gb, each row is a snapshot, format is like this:
,30.624806 1, 104.136604, 1201 4/8/3 21:18:46
From left to right in turn is: the taxi number (a total of 1.2 w car), the current longitude, the current dimensions, whether the current passenger (1 for passenger 0 for no load), the current time
I want to do is to convert the GPS data into the taxi passenger movement track of the data, a passenger line trajectory, like this:
1201 4/8/3 21:18:46, 30.624806, 104.136604, 30.702500, 104.072532, 30.702510, 104.072560, 30.702534, 104.072572
From left to right in turn are: the trajectory of the taxi number, start time, then from start to finish of GPS location record
As a text file by taxi number for sorting, and every taxi all snapshot is not sorted in chronological order, so I want to 1. The first snapshot at the taxi number, multistage sort time two indicators, 2. Think of some way to put in a good order data into trajectory data (this also have no idea)
Problem is data volume is too big, my code is completely motionless, trying a number of ways are useless, for bosses to give directions, thank!

 # In [1] 
Import a datetime
Now_time=datetime. Datetime. Now (). The strftime (' % Y - % m % d - m - H - % % % S ')
Print (" start_time: "+ STR (now_time))
# [In]
The file=r F: \ "didi \ 20140803 _train. TXT". The encode (' utf-8). Decode (' utf-8)
With the open (file) as f:
Dic=[]
For the line in f.r eadlines () :
The line=line. Strip (' \ n ')
Dic. Append (line. The split (", "))
Now_time=datetime. Datetime. Now (). The strftime (' % Y - % m % d - m - H - % % % S ')
Print (" time_1: "+ STR (now_time))
# In [2]
# to the time and date for split time column
For I in range (0, len (dic)) :
Dic [I]. Append (dic [I] [4]. The split (" ") [0])
Dic [I]. Append (dic [I] [4]. The split (" ") [1])
Dic [I]. Remove (dic [I] [4])
Now_time=datetime. Datetime. Now (). The strftime (' % Y - % m % d - m - H - % % % S ')
Print (" time_2: "+ STR (now_time))
# In [3]
# will be divided into time hours, minutes seconds three columns
For I in range (0, len (dic)) :
Dic [I]. Append (dic [I] [5]. The split (" : ") [0])
Dic [I]. Append (dic [I] [5]. The split (" : ") [1])
Dic [I]. Append (dic [I] [5]. The split (" : ") [2])
Dic [I]. Remove (dic [I] [5])
Now_time=datetime. Datetime. Now (). The strftime (' % Y - % m % d - m - H - % % % S ')
Print (" time_3: "+ STR (now_time))
# In [4]
# will taxi, hours, minutes, seconds, date in the first five columns and convenient ordering
For I in range (0, len (dic)) :
Dic [I]=dic [I] [1-0] + dic [I] [8] + dic [I] [8] + dic [I] [1-3]
Dic=sorted (dic)
Now_time=datetime. Datetime. Now (). The strftime (' % Y - % m % d - m - H - % % % S ')
Print (" end_time: "+ STR (now_time))





CodePudding user response:

great god for help

CodePudding user response:

Database established two libraries, and the new library and old library
The first library is the old library, to write text file line by line into the
From the old library to extract data from a specific vehicle, and out into a new library, and is expected to be 1.2 w table
Pray you the contents of a text file is sorted according to time
If not, the new library data needs to be in accordance with the time sequence
Then extract a car passenger state is 1, the coordinates of is your

CodePudding user response:

references on the second floor day I reply:
database established two libraries, the new library and old library
The first library is the old library, to write text file line by line into the
From the old library to extract data from a specific vehicle, and out into a new library, and is expected to be 1.2 w table
Pray you the contents of a text file is sorted according to time
If not, the new library data needs to be in accordance with the time sequence
Then extract a car passenger state is 1, the coordinates of is your

Within each text file is a specific one day 1.2 w taxi GPS data, the taxi number sequence arranged in a row, but every taxi GPS data is not in accordance with the time in good order,
The other is in the way you said haven't separate the manned the trajectory of every time?

CodePudding user response:

More simple,
All put into the database, by date table
25 table
Extracting the data of one day a car, amount of data should be few,
Traversing the data, when the passenger is 1, obtain the data back row is 1, does not stop, 0 are extracted from the data on the new TXT file, the data from the original list of deleted, restart traversal, and until the original list is 0.
Traverse over, TXT file has several lines of passengers are all 1 data,

CodePudding user response:

A single file has 2 g, do you still dare to direct readlines ? Card to die you!
===============================

Decomposition of this problem, you should be:
1. Design a database table, just the problem of how to table first, such as can be produced by date table
2. Each text file line by line readline read and written to the corresponding database table
3. The data are historical data because of you, so you can at the database level to establish a convenient view query in the future

CodePudding user response:

To get rid of unused data without passengers, this will save a lot of time, and then read the line, every time a car data, or use the multithreading try

CodePudding user response:

references on 4th floor day I reply:
simpler,
All put into the database, by date table
25 table
Extracting the data of one day a car, amount of data should be few,
Traversing the data, when the passenger is 1, obtain the data back row is 1, does not stop, 0 are extracted from the data on the new TXT file, the data from the original list of deleted, restart traversal, and until the original list is 0.
Traverse over, TXT file has several lines of passengers are all 1 data,

One day in a car many GPS data record is not according to the schedule, so think twice about your answer together should be able to get what I want, I mostly worried about speed, afraid this to run ErSanShiTian

CodePudding user response:

25 g of text data, processing logic is not complicated, do not need to use the database, 1000 cars per mapping an intermediate text files, all of the code can be performed within 1 hour

CodePudding user response:

refer to the eighth floor ice of wind response:
25 g of text data, processing logic is not complicated, do not need to use the database, 1000 cars per mapping between a text file, all code can be performed within 1 hour

Can say specific? Really very grateful!

CodePudding user response:

Do not use readlines function, it will once loaded into memory, it can be
 for the line in the open (file, encoding="utf-8") : 
# the line here is the data for each row
Print (line)

CodePudding user response:

Such as the original file exists d:/trains, the converted file to d:/middles, the following code can be split by file
 
The import of OS, time
T1=time. Time ()
Car2file={}
File_arr=[]
One_file_car_num=100
File_no=1
Car_no=0
Dest_folder="d:/middles2 '
If not OS. Path. The exists (dest_folder) :
OS. The mkdir (dest_folder)
File_arr. Append (open (OS) path) join (dest_folder, s.t xt ' '% % file_no),' w '))

Folder="d:/trains'
Train_files=OS. Listdir (folder)
For file_name in train_files:
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related