Fetching data from postgres database on jupyter notebook-CodePudding

I'm having this script to fetch data from Postgres DB.

POSTGRES_PORT = 'xxxx'
POSTGRES_USERNAME = 'xxx' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES USERNAME
POSTGRES_PASSWORD = 'xxx' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES PASSWORD POSTGRES_DBNAME = 'xxxx' ## CHANGE THIS TO YOUR DATABASE NAME
POSTGRES_DBNAME = 'xxx'

postgres_str = (f'postgresql://{POSTGRES_USERNAME}:{POSTGRES_PASSWORD}@{POSTGRES_ADDRESS}:{POSTGRES_PORT}/{POSTGRES_DBNAME}')
# Create the connection
cnx = create_engine(postgres_str)

When I use the limit, I'm able to fetch it.

table_name = pd.read_sql_query("""SELECT * FROM public.timeline limit 1000""", cnx)
table_name

When I try to fetch without limit, I got this error

"Connection failed A connection to the notebook server could not be established. The notebook will continue trying to reconnect. Check your network connection or notebook server configuration."

In this case, would you recommend I use pyspark? As it looks the data is bigdata?

I use the count and I got "66231781" rows.

CodePudding user response：

By default the database driver for Postgresql uses a client side cursor, but you can use a server side cursor and stream the data to the client in batches. The following code will iterate through the query result in batches of 1,000 rows as set by the chunksize parameter. You can adjust the value of chunksize to meet your needs.

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine(f"postgresql://{POSTGRES_USERNAME}:{POSTGRES_PASSWORD}@{POSTGRES_ADDRESS}:{POSTGRES_PORT}/{POSTGRES_DBNAME}")

with engine.connect().execution_options(stream_results=True) as conn:
    for chunk_df in pd.read_sql("SELECT * FROM public.timeline", conn, chunksize=1000):
        print(f"Dataframe has {len(chunk_df)} rows.")