Home > Net >  Pymongo vs pyscopg3: orders of magnitude difference for writes?
Pymongo vs pyscopg3: orders of magnitude difference for writes?

Time:02-21

I have an application that collects crypto prices from decentralized exchanges (nothing commercial, the goal is mainly to learn some databases techs with real data). I got it working with MongoDB and PostgresSQL. However, I'm seeing massive differences when it comes to writes, in favour of MongoDB. I'm talking several orders of magnitude. I know MongoDB is a NoSQL db and is advertised to be more efficient for this kind of stuff, but I was wondering if I'm missing something with my postgres implementation. I'll describe below how I implemented the logic and I'll try to provide comparable metrics (as much as I can).

To talk to the databases, I use psycopg3 for the Postgres implementation, and pymongo for the MongoDB one.

Here is the data structure I'm writing to the databases:

class PricingInfo(NamedTuple):
    blockchain_name: str
    dex_name: str
    pair_address: str
    token0_symbol: str
    token1_symbol: str
    token0_address: str
    token1_address: str
    raw_reserve0: int
    raw_reserve1: int
    reserve0: float
    reserve1: float
    mid_price: float
    datetime: pendulum.DateTime
    inverted: bool = False

    @classmethod
    def from_dict(cls, doc: dict) -> PricingInfo:
        doc.pop("_id", None)
        return cls(**doc)

    def to_dict(self) -> dict:
        return self._asdict()

It's the same for both implementations. I have hundreds of them to write to the db every second. Here is how I do it for postgres:

def register_prices(self, prices: list[PricingInfo]) -> None:
    query = """
    insert into prices (
        blockchain_name,
        dex_name,
        pair_address,
        token0_address,
        token1_address,
        raw_reserve0,
        raw_reserve1,
        reserve0,
        reserve1,
        mid_price,
        datetime
    )
    values (
        %(blockchain_name)s,
        %(dex_name)s,
        %(pair_address)s,
        %(token0_address)s,
        %(token1_address)s,
        %(raw_reserve0)s,
        %(raw_reserve1)s,
        %(reserve0)s,
        %(reserve1)s,
        %(mid_price)s,
        %(datetime)s
    )
    """

    keys_to_keep = {
        "blockchain_name",
        "dex_name",
        "pair_address",
        "token0_address",
        "token1_address",
        "raw_reserve0",
        "raw_reserve1",
        "reserve0",
        "reserve1",
        "mid_price",
        "datetime",
    }

    with psycopg.connect(self.db_uri) as conn:
        with conn.cursor() as cur:
            start = time.perf_counter()
            if len(prices) == 1:
                cur.execute(
                    query,
                    {
                        k: v
                        for k, v in prices[0].to_dict().items()
                        if k in keys_to_keep
                    },
                )
            elif len(prices) > 1:
                cur.executemany(
                    query,
                    [
                        {k: v for k, v in p.to_dict().items() if k in keys_to_keep}
                        for p in prices
                    ],
                )

        conn.commit()
    delta = time.perf_counter() - start

    if self.dex_name in {"pangolin", "trader_joe"}:
        logger.warning(f"Inserting {len(prices)}")
        logger.warning(f"Inserting prices took {delta} seconds")

And here is my table definition:

create table prices (
    id serial primary key,
    blockchain_name varchar(100) not null,
    dex_name varchar(100) not null,
    raw_reserve0 decimal not null,
    raw_reserve1 decimal not null,
    reserve0 decimal not null,
    reserve1 decimal not null,
    mid_price decimal not null,
    datetime timestamp with time zone not null,
    pair_address varchar(50) not null,
    token0_address varchar(50) not null,
    token1_address varchar(50) not null,
    foreign key (blockchain_name, dex_name, pair_address) references pairs (blockchain_name, dex_name, pair_address),
    foreign key (blockchain_name, dex_name, token0_address) references tokens (blockchain_name, dex_name, address),
    foreign key (blockchain_name, dex_name, token1_address) references tokens (blockchain_name, dex_name, address)
);

For MongoDB:

def register_prices(self, prices: list[PricingInfo]) -> None:

    start = time.perf_counter()

    prices_table = self._db["prices"]
    prices_table.insert_many(price.to_dict() for price in prices)

    delta = time.perf_counter() - start

    if self.dex_name in {"pangolin", "trader_joe"}:
        logger.warning(f"Inserting {len(prices)}")
        logger.warning(f"Inserting prices took {delta} seconds")

The app is run exactly in the same way for the two databases. There is a tiny difference when writing with postgres, where the dict of data needs to be modified a bit to fit the schema (I'm doing some normalization here), but since I only have ~600 dicts of data to modify every time I write, I don't think this is the bottleneck. I have 8 processes that write concurrently to the DB in both cases.

For postgres, I'm getting these metrics:

Inserting 587
Inserting prices took 1.175270811014343 seconds
Inserting 611
Inserting prices took 0.3126116280036513 seconds

For mongo:

Inserting 588
Inserting prices took 0.03671051503624767 seconds
Inserting 612
Inserting prices took 0.032324473024345934 seconds

These timings are relatively stable, it's ~1s and 300ms for postgres, and around 30ms for Mongo. Its' very curious that postgres has two different write timings for roughly the same amount of data. Nonetheless, even in the best case for postgres, it's still 10x slower than mongo.

Additional notes:

  • for postgres, I have some foreign key constraints. I tried removing these constraints, but it didn't impact the timings significantly
  • I tried alter user postgres set synchronous_commit to off; for postgres, with no visible impact on the timings
  • I run postgres:14 and mongo:4, through Docker

Am I doing something wrong with Postgres?

CodePudding user response:

For batch import of data into Postgres generally the fastest way is to use the Postgres command COPY. In psycopg3 this is available via the procedures shown here psycopg3 COPY. The caveat is that COPY is all or none, either all the data is imported or an error means none of it is imported.

  • Related