I am new to python, pandas and mariadb and I need your help and advice, please. I have a csv file that contains data with operations performed on every day of a month. I need to extract all the values of a specific day and insert them into the database. I thought of storing all the data in a DataFrame than I would then iterate and store the data of that day in another DataFrame, but I get the following warning:
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Here is my code so far:
import datetime as dt
import pandas as pd
data = pd.read_csv('./files/salesreport.csv',
names=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6',
'col_7', 'col_8', 'col_9', 'col_10'],
sep=',',
header=None,
keep_default_na=False,
na_values="")
df = pd.DataFrame(data)
pd.set_option('display.max_columns', None)
now = dt.datetime.now()
# today_date = now.strftime("%m-%d-%Y")
today_date = '07-17-2022' //i'm using this date for testing
tmp_df = pd.DataFrame()
for (index, row) in df.iterrows():
if row.col_1 == today_date:
tmp_df = tmp_df.append(row, ignore_index=True)
I think my approach could also raise performance problems if the file is too big. Can you please help me to find the best solution to be able to make the insert in the database only of the records specific to a certain day? Also, can you tell me what is the best approach to make an insert in the database with MariaDB?
Thanks!
CodePudding user response:
This sounds like a simple filter to me:
tmp_df = df[df.col_1 == today_date]
(and you can follow this with a .reset_index()
, if that's important)
Indeed, repeat append is slow and a poor model for working with pandas.
CodePudding user response:
It could be done like this
import pandas as pd
from sqlalchemy import create_engine
engine=create_engine('mysql pymysql://root:root@localhost/test')
df= pd.read_csv('../files/salesreport.csv', names=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6',
'col_7', 'col_8', 'col_9', 'col_10'], sep=',', header=None)
#filter by date
df = df[df['col_1'].str.contains('07-17-2022')]
#write data to mysql
df.to_sql('salesreport', engine, schema='test', if_exists='replace', index=True, index_label='id')
Or converting to datetime
#convert to datetime
df['col_1']=pd.to_datetime(df['col_1'], format='%m-%d-%Y')
#filter by date
df=df[df['col_1']=='2022-07-17']