Does Pandas have a dataframe length limit?-CodePudding

I want to create a system where I load and analyze large amounts of data into pandas. Also, I will later use this to write back to .parquet files

when I try to test this using a simple example, I see that there is some kind of built in limit on the number of rows

import pandas as pd

# Create file with 100 000 000 rows
contents = """
Tommy;19
Karen;20
"""*50000000

open("person.csv","w").write(
f"""
Name;Age
{contents}
"""
)
print("Test generated")

df = pd.read_csv("person.csv",delimiter=";")
len(df)

returns 10 000 000. Not 100 000 000

CodePudding user response：

Change the method to create the file because I think you have to many blank rows and you don't close properly your file (without context manager or explicit close() method):

# Create file with 100 000 000 rows
contents = """\
Tommy;19
Karen;20
"""*50000000

with open('person.csv', 'w') as fp:
    fp.write('Name;Age\n')
    fp.write(contents)

Read the file:

df = pd.read_csv('person.csv', delimiter=';')
print(df)

# Output
           Name  Age
0         Tommy   19
1         Karen   20
2         Tommy   19
3         Karen   20
4         Tommy   19
...         ...  ...
99999995  Karen   20
99999996  Tommy   19
99999997  Karen   20
99999998  Tommy   19
99999999  Karen   20

[100000000 rows x 2 columns]

CodePudding user response：

I don't think there is a limit , but there is a limit to how much it can process at a time, but that u can go around it by making code more efficient..

currently I am working with around 1-2 million rows without any issues

CodePudding user response：

The main bottleneck is your memory, Pandas uses NumPy under the hood. So you can fit 10M rows until it's not an issue for your computer