What are the best ways to save the data for my data science project.
**EDIT: The main problem/question I am facing. Can I store the data in a single table?
I presume no as they have different table structure for each and every student. Also, I need to enumerate no of exam they have taken, results they got so as for matches.**
Data source 1 (Currently stored as a list of strings) I am collecting this data from a large csv file of student details.
Student X
Age, 35
Home, Agrentina
Height, 134
Student Y
Age, 34
Home, Brazil
Height, 134
Student Z
Age, 24
Home, India
Height, 134
##Study stats
Student X
Exam,Marks,Comment,
English, 120, Good writing,
Math, 159, Excelent,
Geology, 105, need better,
Student z
Exam,Marks,Comment,
Physics, 10, bad writing,
Math, 159, Excelent,
exercise, 145, need better,
Game stats
Batting data
Student X
Match no,Run,Dismisal,
Match 13, 120, Bowled,
Match 22, 19, stamping,
Match 31, 15, Bowled,
Student Y
Match no,Run,Dismisal,
Match 56, 60, stamping,
Match 65, 19, stamping,
Match 29, 15, Bowled,
Bowling data
Student X
Match no,Ball, Run rate, Type of Bowling, Description,
Match 43, 120, 7.5,Fast, Boweld fast; Pace High; Yorker;
Match 42, 19, 48.5,Spin, Bowled off break; Yorker;
Match 41, 15, 38.5,Fast, Yorker; Bowled slow;
Student Z
Bowling
Match no,Ball, Run rate, Type of Bowling, Description,
Match 51, 60, 9.4, Fast, Boweld fast; Pace High; Yorker;
Match 40, 48, 92.2, Fast, Yorker; Bowled slow;
Description are long text (200 word) I need those later.
PDF files and picture files ( Have node with the files yet,)
Student X
Paper 1, file 1242
Picture 45, file 14950
Student Y
Paper 145, file 14893
Picture 45049, file 2048
Now the problems I am currently getting the data from respective csvs for each specific student and concatenating them into one and saving a csv of each student,
Okay my current code is
import pandas as pd
students= [Student X,Student Y,Student Z]
df1 = pd.read_csv('Student_stat.csv', encoding='unicode_escape', low_memory=False)
df2 = pd.read_csv('Student_Matches.csv', encoding='unicode_escape', low_memory=False)
for student in students:
df3 = df1.loc[df1['Name'].str.contains(student)]
df4 = pd.read_csv('Student_Exams.csv', encoding='unicode_escape', low_memory=False)
for student in students:
df5 = df1.loc[df1['Name'].str.contains(student)]
df_dataname = pd.concat([df1.df3,df5])
As you can understand which creating problem with table headers.
How can I save the data so that they are easily callable.
CodePudding user response:
You can use the pandas, the pandas have the to_parquet, you can use it.
This is the documentation https://pandas.pydata.org/pandas-docs/version/1.1/reference/api/pandas.DataFrame.to_parquet.html
CodePudding user response:
you can make use of the pandas pd.DataFrame.to_pickle()
method.
This will save your data in a pickle file, which can be easily read into a dataframe with pandas.read_pickle(filepath_or_buffer)
import pandas as pd
from pandas.testing import assert_frame_equal
df = pd.DataFrame({'Day': [1,2,3,4,5,6,7], 'Temp':[10,12,14,16,18,19,20]})
df.to_pickle('MyPickle.pkl')
df2 = pd.read_pickle('MyPickle.pkl')
assert_frame_equal(df, df2)
CodePudding user response:
the best way is dictionary base