I have a very large matrix of 100K x 100K and to manage memory better, I am setting this as a dataframe with int8 as datatype (for 1 byte per cell). However, it gets set as a float with 8 bytes per cell. Where am I going wrong?
df = pd.DataFrame()
df=df.astype('int8')
mat_len=100,000
for i in range(0, mat_len):
new_row = pd.Series([0] * mat_len)
df = df.append(new_row, ignore_index='True') #initializing matrix
for i in range(0, mat_len):
for j in range(i,mat_len):
df.iloc[i,j] = i j #simplified calc for testing purposes
print(df.info())
output: dtypes: float64(100)
CodePudding user response:
Don't make things complicated, use numpy:
df = pd.DataFrame(np.zeros((100,100), dtype='int8'))
>>> df.dtypes
0 int8
1 int8
2 int8
3 int8
4 int8
...
95 int8
96 int8
97 int8
98 int8
99 int8
Length: 100, dtype: object
CodePudding user response:
You need to move the conversion inside your loop:
df = pd.DataFrame()
mat_len=100
for i in range(0, mat_len):
new_row = pd.Series([0] * mat_len)
df = df.append(new_row, ignore_index='True')
df=df.astype('int8')#initializing matrix
for i in range(0, mat_len):
for j in range(i,mat_len):
df.iloc[i,j] = i j #simplified calc for testing purposes
print(df.info())
which returns:
class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 100 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 100 non-null int8
1 1 100 non-null int8
2 2 100 non-null int8
3 3 100 non-null int8
4 4 100 non-null int8
5 5 100 non-null int8
6 6 100 non-null int8
7 7 100 non-null int8
8 8 100 non-null int8
9 9 100 non-null int8
10 10 100 non-null int8
11 11 100 non-null int8
12 12 100 non-null int8
13 13 100 non-null int8
14 14 100 non-null int8
15 15 100 non-null int8
16 16 100 non-null int8
17 17 100 non-null int8
18 18 100 non-null int8
19 19 100 non-null int8
20 20 100 non-null int8
21 21 100 non-null int8
22 22 100 non-null int8
23 23 100 non-null int8
24 24 100 non-null int8
25 25 100 non-null int8
26 26 100 non-null int8
27 27 100 non-null int8
28 28 100 non-null int8
29 29 100 non-null int8
30 30 100 non-null int8
31 31 100 non-null int8
32 32 100 non-null int8
33 33 100 non-null int8
34 34 100 non-null int8
35 35 100 non-null int8
36 36 100 non-null int8
37 37 100 non-null int8
38 38 100 non-null int8
39 39 100 non-null int8
40 40 100 non-null int8
41 41 100 non-null int8
42 42 100 non-null int8
43 43 100 non-null int8
44 44 100 non-null int8
45 45 100 non-null int8
46 46 100 non-null int8
47 47 100 non-null int8
48 48 100 non-null int8
49 49 100 non-null int8
50 50 100 non-null int8
51 51 100 non-null int8
52 52 100 non-null int8
53 53 100 non-null int8
54 54 100 non-null int8
55 55 100 non-null int8
56 56 100 non-null int8
57 57 100 non-null int8
58 58 100 non-null int8
59 59 100 non-null int8
60 60 100 non-null int8
61 61 100 non-null int8
62 62 100 non-null int8
63 63 100 non-null int8
64 64 100 non-null int8
65 65 100 non-null int8
66 66 100 non-null int8
67 67 100 non-null int8
68 68 100 non-null int8
69 69 100 non-null int8
70 70 100 non-null int8
71 71 100 non-null int8
72 72 100 non-null int8
73 73 100 non-null int8
74 74 100 non-null int8
75 75 100 non-null int8
76 76 100 non-null int8
77 77 100 non-null int8
78 78 100 non-null int8
79 79 100 non-null int8
80 80 100 non-null int8
81 81 100 non-null int8
82 82 100 non-null int8
83 83 100 non-null int8
84 84 100 non-null int8
85 85 100 non-null int8
86 86 100 non-null int8
87 87 100 non-null int8
88 88 100 non-null int8
89 89 100 non-null int8
90 90 100 non-null int8
91 91 100 non-null int8
92 92 100 non-null int8
93 93 100 non-null int8
94 94 100 non-null int8
95 95 100 non-null int8
96 96 100 non-null int8
97 97 100 non-null int8
98 98 100 non-null int8
99 99 100 non-null int8
dtypes: int8(100)
memory usage: 9.9 KB
None
CodePudding user response:
You should always avoid append
to DataFrames/Series, especially avoid using it in a loop. It's very very slow. First, generate the data and then create a DataFrame with it.
df.iloc[i,j] = i j #simplified calc for testing purposes
How complex is your calculation? In this simple case, your code can be highly simplified and optimized, by using numpy.fromfunction
and numpy.triu
mat_len = 100_000
# create matrix from a function of the indicies
mat = np.fromfunction(lambda i,j: i j, shape=(mat_len, mat_len), dtype='int8')
# make it an upper triangular matrix
mat = np.triu(mat)
# create a DataFrame with it
df = pd.DataFrame(mat)