i have tables in sqlite size of 649876358 line. I need get all lines with repetitive value in column and write every group in another file but sqlite working very slowing (after 10 minutes not one group not writed). What i have do for faster work?
I used python3 sqlite
db = sql.connect(path_to_db)
with db:
db_cursor = db.cursor()
countLine = 649876358 #db_cursor.execute("SELECT COUNT(*) FROM 'T'").fetchall()[0][0]
for i in range(1, countLine 1):
line = db_cursor.execute(f"SELECT * FROM 'T' WHERE ROWID = {i}", ).fetchall()[0]
value = line[0]
fileName = f"{path_to_save}{value}.json"
if os.path.exists(fileName):
continue
allValue = db_cursor.execute(f"SELECT * FROM 'T' WHERE value = '{value}'").fetchall()
if len(allValue) > 1:
with open(fileName, "w") as file:
json.dump(allValue, file)
CodePudding user response:
There are a few things that would help but first of all I would add an index to your database:
create index your_index_name_here on your_table_name_here (your_column_name_here);
I just created a database with 3 columns id, name, number like this with 1 million lines (you have 600 million lines):
1 R7N47UCYNE 363233
2 BFDXTROHVZ 273710
3 AVN6B8RCS7 949561
4 OHWTUCWDQW 863241
5 7F0B7UDJI4 791392
6 CV5TGBAQBJ 919640
7 PP8UF35C8J 897746
8 CN2J9ZAO63 671596
9 ZPM338I8QN 601521
10 12BFX9LXGE 845189
11 2NFP6DG299 858336
12 T5GH1V609M 692232
Searching for 10 names within my database before the index looked like this:
# (1999841, 'K6IWJK0MNP', 500224)
# (1715030, 'RY0N5VO9UM', 148000)
# (1999694, '474R4EMI3T', 705184)
# (1999539, 'L2YYGSLMUH', 943698)
# (1999425, 'UT352CDTQH', 828853)
# (1714958, 'KKAQ36DZGV', 279517)
# (1714905, 'BZ9CNWL5LS', 320070)
# (1714877, '6X7WCBBIZF', 247905)
# (1714839, '4MSV61HY9I', 263813)
# (1714806, '7NSRANTWEP', 476828)
# Finished in 1.3834 seconds
After the index was created:
# (1999841, 'K6IWJK0MNP', 500224)
# (1715030, 'RY0N5VO9UM', 148000)
# (1999694, '474R4EMI3T', 705184)
# (1999539, 'L2YYGSLMUH', 943698)
# (1999425, 'UT352CDTQH', 828853)
# (1714958, 'KKAQ36DZGV', 279517)
# (1714905, 'BZ9CNWL5LS', 320070)
# (1714877, '6X7WCBBIZF', 247905)
# (1714839, '4MSV61HY9I', 263813)
# (1714806, '7NSRANTWEP', 476828)
# Finished in 0.0006 seconds
Your can also set up a multiple cursors to do the second search rather than do a count/range. Your first query could be:
select_all = "SELECT * FROM your_table"
rows = cursor.execute(select_all).fetchall()
for row in rows:
value = row[0]
.....etc etc
Then use the results to generate the a second query:
allValue = cursor2.execute(f"SELECT * FROM 'T' WHERE value = '{value}'").fetchall()
Finally SQLite might not be the best database to work with for this volume of rows. You could use a more performant database and dump the SQLite one to something like MariaDB:
sqlite3 sample.db .dump > dump.sql
Then into your new database with a command like:
mysql -p -u root -h 127.0.0.1 newdb < dump.sql