Home > Enterprise >  Convert awkwardly formatted data into a dataframe with pandas python
Convert awkwardly formatted data into a dataframe with pandas python

Time:05-06

I have some data in the following csv format:

Variable 1
Time Value
Time1 12
Time2 32
Time3 4
Time4 5
Time5 34
Time6 5
Time7 46
Time8 7
Time9 8
Time10 543
Variable 2
Time Value
Time1 1 2 3
Time2 2 45 5
Time3 4 2 54
Time4 3 1 2
Time5 3 2 4
Time6 4 5 8
Time7 4 7 4
Time8 8 65 12
Time9 12 8 14
Time10 65 65 13
Variable 3
Time Value
Time1 3
Time2 4
Time3 5
Time4 2
Time5 1
Time6 7
Time7 5
Time8 3
Time9 5
Time10 7

And want to put it in the following data frame format with pandas:

         Variable1    Variable2    Variable3    
Time1           12     [1,2,3]             3
Time2           32    [2,45,5]             4
Time3            4    [4,2,54]             5

How would I even go about this? I know the format is awful, don't ask me why it's like that but I'm kinda stuck with it. I really don't even know where to begin with this one. TIA

CodePudding user response:

updated code as per comments

initial file read is based on code from here:

import pandas as pd
import numpy as np

file = r'C:\Test\TIMBER-1100-10M.csv'
# Loop the data lines
with open(file, 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

# Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
df = pd.read_csv(file, header=None, delimiter=",", names=column_names)

# preparing dataframe for pivot
df['Variable'] = np.where(df[0].str.contains('VARIABLE:'), df[0], np.nan)
df['Variable'].ffill(inplace=True)
df[1].dropna(inplace=True)
drop_values = ['Timestamp','VARIABLE:']
df2 = df[~df[0].str.contains('|'.join(drop_values))].astype({col: str for col in df.columns[2:]})

conc_col = df2.columns.to_list()
conc_col.remove(0)
conc_col.remove('Variable')

df2['Value'] = df2[conc_col].apply(lambda x: ','.join(x.dropna()), axis=1).str.strip(',nan')
df2.rename(columns={ df.columns[0]: "Time" }, inplace = True)

# creating the pivot as final dataframe
pivot = df2.pivot_table(index=['Time'],
                        columns=['Variable'],
                        values='Value',
                        aggfunc='sum')\
                        .rename_axis(None, axis=1)\
                        .reset_index()
pivot.to_excel(r'C:\Test\temp1.xlsx')
  • Related