What are the best ways to save the data for my data science project?-CodePudding

What are the best ways to save the data for my data science project.

**EDIT: The main problem/question I am facing. Can I store the data in a single table?

I presume no as they have different table structure for each and every student. Also, I need to enumerate no of exam they have taken, results they got so as for matches.**

Data source 1 (Currently stored as a list of strings) I am collecting this data from a large csv file of student details.

Student X
Age, 35
Home, Agrentina
Height, 134

Student Y
Age, 34
Home, Brazil
Height, 134

Student Z
Age, 24
Home, India
Height, 134

##Study stats

Student X
Exam,Marks,Comment,
English, 120, Good writing, 
Math, 159, Excelent,
Geology, 105, need better,


Student z
Exam,Marks,Comment,
Physics, 10, bad writing, 
Math, 159, Excelent,
exercise, 145, need better,

Game stats

Batting data

Student X
Match no,Run,Dismisal,
Match 13, 120, Bowled, 
Match 22, 19, stamping,
Match 31, 15, Bowled,

Student Y
Match no,Run,Dismisal,
Match 56, 60, stamping, 
Match 65, 19, stamping,
Match 29, 15, Bowled,

Bowling data

Student X
Match no,Ball, Run rate, Type of Bowling, Description,
Match 43, 120, 7.5,Fast, Boweld fast; Pace High; Yorker; 
Match 42, 19, 48.5,Spin, Bowled off break; Yorker; 
Match 41, 15, 38.5,Fast, Yorker; Bowled slow;


Student Z
Bowling
Match no,Ball, Run rate, Type of Bowling, Description,
Match 51, 60, 9.4, Fast, Boweld fast; Pace High; Yorker; 
Match 40, 48, 92.2, Fast, Yorker; Bowled slow;

Description are long text (200 word) I need those later.

PDF files and picture files ( Have node with the files yet,)

Student X

Paper 1, file 1242
Picture 45, file 14950


Student Y

Paper 145, file 14893
Picture 45049, file 2048

Now the problems I am currently getting the data from respective csvs for each specific student and concatenating them into one and saving a csv of each student,

Okay my current code is

import pandas as pd



students= [Student X,Student Y,Student Z]
df1 = pd.read_csv('Student_stat.csv', encoding='unicode_escape', low_memory=False)
df2 = pd.read_csv('Student_Matches.csv', encoding='unicode_escape', low_memory=False)
for student in students:
    df3 = df1.loc[df1['Name'].str.contains(student)]
df4 = pd.read_csv('Student_Exams.csv', encoding='unicode_escape', low_memory=False)
for student in students:
    df5 = df1.loc[df1['Name'].str.contains(student)]
df_dataname = pd.concat([df1.df3,df5])

As you can understand which creating problem with table headers.

How can I save the data so that they are easily callable.

CodePudding user response：

You can use the pandas, the pandas have the to_parquet, you can use it.

This is the documentation https://pandas.pydata.org/pandas-docs/version/1.1/reference/api/pandas.DataFrame.to_parquet.html

CodePudding user response：

you can make use of the pandas pd.DataFrame.to_pickle() method.

This will save your data in a pickle file, which can be easily read into a dataframe with pandas.read_pickle(filepath_or_buffer)

import pandas as pd
from pandas.testing import assert_frame_equal

df = pd.DataFrame({'Day': [1,2,3,4,5,6,7], 'Temp':[10,12,14,16,18,19,20]})
df.to_pickle('MyPickle.pkl')
df2 = pd.read_pickle('MyPickle.pkl')

assert_frame_equal(df, df2)

CodePudding user response：

the best way is dictionary base