how to clean and rearrange a dataframe with pairs of date and price columns into a df with common da-CodePudding

I have a dataframe of price data that looks like the following: (with more than 10,000 columns)

	Unamed: 0	01973JAC3 corp	Unamed: 2	019754AA8 corp	Unamed: 4	01265RTJ7 corp	Unamed: 6	01988PAD0 corp	Unamed: 8	019736AB3 corp
1	2004-04-13	101.1	2008-06-16	99.1	2010-06-14	110.0	2008-06-18	102.1	NaT	NaN
2	2004-04-14	101.2	2008-06-17	100.4	2010-07-05	110.3	2008-06-19	102.6	NaT	NaN
3	2004-04-15	101.6	2008-06-18	100.4	2010-07-12	109.6	2008-06-20	102.5	NaT	NaN
4	2004-04-16	102.8	2008-06-19	100.9	2010-07-19	110.1	2008-06-21	102.6	NaT	NaN
5	2004-04-19	103.0	2008-06-20	101.3	2010-08-16	110.3	2008-06-22	102.8	NaT	NaN
...	...	...	...	...	...	...	...	...	NaT	NaN
3431	NaT	NaN	2021-12-30	119.2	NaT	NaN	NaT	NaN	NaT	NaN
3432	NaT	NaN	2021-12-31	119.4	NaT	NaN	NaT	NaN	NaT	NaN

(Those are 9-digit CUSIPs in the header. So every two columns represent date and closed price for a security.) I would like to

find and get rid of empty pairs of date and price like "Unamed: 8" and"019736AB3 corp"
then rearrange the dateframe to a panel of monthly close price as following:

Date	01973JAC3	019754AA8	01265RTJ7	01988PAD0
2004-04-30	102.1	NaN	NaN	NaN
2004-05-31	101.2	NaN	NaN	NaN
...	...	...	...	...
2021-12-30	NaN	119.2	NaN	NaN
2021-12-31	NaN	119.4	NaN	NaN

Edit: I wanna clarify my question.

So my dataframe has more than 10,000 columns, which makes it impossible to just drop by column names or change their names one by one. The pairs of date and price start and end at different time and are of different length (, and of different frequency). I m looking for an efficient way to arrange therm into a less messy form. Thanks.

Here is a sample of 30 columns. https://github.com/txd2x/datastore file name: sample-question2022-01.xlsx

I figured out: stacking and then reshaping.Thx for the help.

for i in np.arange(len(price.columns)/2):

   temp =DataFrame(columns = ['Date', 'ClosedPrice','CUSIP'])
   temp['Date'] = price.iloc[ 0:np.shape(price)[0]-1, int(2*i)]
   temp['ClosedPrice'] = price.iloc[0:np.shape(price)[0]-1, int(2*i 1)]
   temp['CUSIP'] =price.columns[int(i*2 1)][:9] #
   df = df.append(temp)
   #use for loop to stack all the column pairs
df = df.dropna(axis=0, how = 'any') # drop nan rows
df = df.pivot(index='Date', columns = 'CUSIP', values = 'ClosedPrice') #reshape dataframe to have Date as index and CUSIP and column headers
df_monthly=df.resample('M').last() #finding last price of the month

CodePudding user response：

if you want to get rid of unusful columns then perform the following code:

df.drop("name_of_column", axis=1, inplace=True)

if you want to drop empty rows use:

df.drop(df.index[row_number], inplace=True)

if you want to rearrange the data using 'timestamp and date' you need to convert it to a datetime object and then make it as index:

import datetime
df.Date=pd.to_datetime(df.Date)
df = df.set_index('Date')

and you probably want to change column name before doing any of that above, df.rename(columns={'first_column': 'first', 'second_column': 'second'}, inplace = True)

Updated01: if you want to keep just some columns of those 10000, lets say for example 10 or 7 columns, then use df = df[["first_column","second_column", ....]]

if you want to get rid of all empty columns use: df.dropna(axis=1, how = 'all') "how" keyword have two values: "all" to drop the whole row or column if it is full of Nan, "any" to drop the whole row or column if it have one Nan at least.

Update02: Now if you have got a lot of date columns and you just want to keep one of them, supposing that you have choosed a date column that have no "Nan" values use the following code:

columns=df.columns.tolist()

for column in columns:
    try:
        if(df[column].dtypes=='object'):
            df[column]=pd.to_datetime(df[column]).
            if(df[column].dtypes=='datetime64[ns]')&(column!='Date'):
                df.drop(column,axis=1,inplace=True)
    except ValueError:
        pass

rearrange the dataframe using months:

import datetime
df.Date=pd.to_datetime(df.Date)
df['Month']=df.Date.dt.month
df['Year']=df.Date.dt.year
df = df.set_index('Month')
df.groupby(["Year","Month"]).mean()

update03: To combine all date columns while preserving data use the following code:

import pandas as pd 
import numpy as np

df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
    if (df[column].isnull().sum()>2300):
        df.drop(column,axis=1,inplace=True) 
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:

    if(df[column].dtypes=='datetime64[ns]'):
        df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
    else:
        df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
    merged=pd.merge(merged,df[[columns[i],columns[i 1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
    k =1

the only problem left that it will throw a memory Error.

MemoryError: Unable to allocate 27.4 GiB for an array with shape (3677415706,) and data type int64