How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?-CodePudding

I use PDI(kettle) to extract the data from mongodb to greenplum. I tested if extract the data from mongodb to file, it was faster, about 10000 rows per second. But if extract into greenplum, it is only about 130 per second. And I modified following parameters of greenplum, but it is no significant improvement.

gpconfig -c log_statement -v none
gpconfig -c gp_enable_global_deadlock_detector -v on

And if I want to add the number of output table. It seems to be hung up and no data will be inserted for a long time. I don't know why?

How to increase the performance of insert data from mongo to greenplum with PDI(kettle)? Thank you.

CodePudding user response：

There are a variety of factors that could be at play here.

Is PDI loading via an ODBC or JDBC connection?
What is the size of data? (row count doesn't really tell us much)
What is the size of your Greenplum cluster (# of hosts and # of segments per host)
Is the table you are loading into indexed?
What is the network connectivity between Mongo and Greenplum?

The best bulk load performance using data integration tools such as PDI, Informatica Power Center, IBM Data Stage, etc.. will be accomplished using Greenplum's native bulk loading utilities gpfdist and gpload.

CodePudding user response：

Greenplum love batches.

a) You can modify batch size in transformation with Nr rows in rowset.

b) You can modify commit size in table output.

I think a and b should match.

Find your optimum values. (For example we use 1000 for rows with big json objects inside)