Home > Back-end >  Data cleanup in Oracle DB is taking long time for 300 billion records
Data cleanup in Oracle DB is taking long time for 300 billion records

Time:09-16

Problem statement: There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc. Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables. So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.

Tried solution:

  • We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
  • But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
  • We have tried to have index for hash column, but process is still taking time.
  • Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
  • We are planning to do processing in chunks, but it will be very long an on-going activity.

Questions:

  1. Would like to if any further improvement can be made in above approach
  2. Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
  3. Is there any some sort of tool that can be used here?

CodePudding user response:

It's possible use oracle merge instruction if you use clean sql.

CodePudding user response:

Suggestion 1

Use built-in delete parallel

delete /*  parallel(t 8) */ mytable t where ...

Suggestion 2

Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.

process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...

Suggestion 3

If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take

P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.

  • Related