Home > Blockchain >  How to insert multiple rows of complex data into a dataframe in pandas?
How to insert multiple rows of complex data into a dataframe in pandas?

Time:09-08

I am quite new to data engineering and want to see if I can plot the daily streams for a number of tracks in order to find a common model that mimics the streaming pattern over years for a song.

I got input data in the form of:

{
    "date": "2021-06-13",
    "streams_total": 1600432,
}, {
    "date": "2021-06-14",
    "streams_total": 1600432,
}
..

It is not daily data since release of the song but rather it depends on how new song it is. Some songs I miss like 1-2 initial years of data.

My first task is trying to read this into a DataFrame using pandas. I am unsure how to structure the data also how I can compare multiple songs. One idea I am having is instead of date, use release date and calculate number of days since release. Also, I am thinking of scale each son so that they sum up to 1.0. That way I can compare multiple songs that I got different data from and that have different total number of streams.

Given this, I am thinking of using the number of days since release as the column "header" and each row is the song. I could set the days I don't have data for to 0 or NaN


 ------- ------- -------- -------- ------------ 
| Day0  | Day1  |  Da2   |  Day4  | ISCR Code  |
 ------- ------- -------- -------- ------------ 
| 23231 | 23111 |  19232 |  19233 | USRCAB123B |
|     0 |     0 | 160131 | 159923 | USHDB1232H |
 ------- ------- -------- -------- ------------ 

Would this be a good idea? I maybe have to clean the data some and removing extreme outliers. How can I do that without messing up the correct columns etc so the data is in sync?

Would it be possible to do a geometrical curve fit to this kind of data?

CodePudding user response:

as a first remark, I feel like you should maybe consider doing with in two steps: storing data then processing it. I think storing data is coherent to do in a dataframe.

When it comes to processing, I would say as a second remark that indexing data on columns is not a good thing, generally speaking.

From where I stand, I would:

  1. Store all data in a one (or several) dataframe(s)
  2. Use a script to work on these df for comparisons purposes

When it comes to the storage itself, you can either store everything (song id, release date, date and streams to this date) in one dataframe, but with the growing size of your sampling data, you may end up with massive df hence massive process times.

It really depends on what you want to achieve ultimately, but you could for instance have a first dataframe whose columns would be: song_id | song_name | release_date | db_name

Each song would hence have its own "db" (i.e. dumped dataframe) file (as reported in the last column above) that would contain all existing samples (typically date | streams). And appending new samples to it should be trivial.

Finally, you processing script will just have to look into the db to be able to compare things...

Once again, it really depends on what you want to achieve, the amount of data, the persistence (or not) of the data, the number of time you want to run this, etc.

  • Related