I want to create a system where I load and analyze large amounts of data into pandas. Also, I will later use this to write back to .parquet files
when I try to test this using a simple example, I see that there is some kind of built in limit on the number of rows
import pandas as pd
# Create file with 100 000 000 rows
contents = """
Tommy;19
Karen;20
"""*50000000
open("person.csv","w").write(
f"""
Name;Age
{contents}
"""
)
print("Test generated")
df = pd.read_csv("person.csv",delimiter=";")
len(df)
returns 10 000 000. Not 100 000 000
CodePudding user response:
Change the method to create the file because I think you have to many blank rows and you don't close properly your file (without context manager or explicit close()
method):
# Create file with 100 000 000 rows
contents = """\
Tommy;19
Karen;20
"""*50000000
with open('person.csv', 'w') as fp:
fp.write('Name;Age\n')
fp.write(contents)
Read the file:
df = pd.read_csv('person.csv', delimiter=';')
print(df)
# Output
Name Age
0 Tommy 19
1 Karen 20
2 Tommy 19
3 Karen 20
4 Tommy 19
... ... ...
99999995 Karen 20
99999996 Tommy 19
99999997 Karen 20
99999998 Tommy 19
99999999 Karen 20
[100000000 rows x 2 columns]
CodePudding user response:
I don't think there is a limit , but there is a limit to how much it can process at a time, but that u can go around it by making code more efficient..
currently I am working with around 1-2 million rows without any issues
CodePudding user response:
The main bottleneck is your memory, Pandas uses NumPy under the hood. So you can fit 10M rows until it's not an issue for your computer