I have a dataframe of price data that looks like the following: (with more than 10,000 columns)
Unamed: 0 | 01973JAC3 corp | Unamed: 2 | 019754AA8 corp | Unamed: 4 | 01265RTJ7 corp | Unamed: 6 | 01988PAD0 corp | Unamed: 8 | 019736AB3 corp | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2004-04-13 | 101.1 | 2008-06-16 | 99.1 | 2010-06-14 | 110.0 | 2008-06-18 | 102.1 | NaT | NaN |
2 | 2004-04-14 | 101.2 | 2008-06-17 | 100.4 | 2010-07-05 | 110.3 | 2008-06-19 | 102.6 | NaT | NaN |
3 | 2004-04-15 | 101.6 | 2008-06-18 | 100.4 | 2010-07-12 | 109.6 | 2008-06-20 | 102.5 | NaT | NaN |
4 | 2004-04-16 | 102.8 | 2008-06-19 | 100.9 | 2010-07-19 | 110.1 | 2008-06-21 | 102.6 | NaT | NaN |
5 | 2004-04-19 | 103.0 | 2008-06-20 | 101.3 | 2010-08-16 | 110.3 | 2008-06-22 | 102.8 | NaT | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | NaT | NaN |
3431 | NaT | NaN | 2021-12-30 | 119.2 | NaT | NaN | NaT | NaN | NaT | NaN |
3432 | NaT | NaN | 2021-12-31 | 119.4 | NaT | NaN | NaT | NaN | NaT | NaN |
(Those are 9-digit CUSIPs in the header. So every two columns represent date and closed price for a security.) I would like to
- find and get rid of empty pairs of date and price like "Unamed: 8" and"019736AB3 corp"
- then rearrange the dateframe to a panel of monthly close price as following:
Date | 01973JAC3 | 019754AA8 | 01265RTJ7 | 01988PAD0 |
---|---|---|---|---|
2004-04-30 | 102.1 | NaN | NaN | NaN |
2004-05-31 | 101.2 | NaN | NaN | NaN |
... | ... | ... | ... | ... |
2021-12-30 | NaN | 119.2 | NaN | NaN |
2021-12-31 | NaN | 119.4 | NaN | NaN |
Edit: I wanna clarify my question.
So my dataframe has more than 10,000 columns, which makes it impossible to just drop by column names or change their names one by one. The pairs of date and price start and end at different time and are of different length (, and of different frequency). I m looking for an efficient way to arrange therm into a less messy form. Thanks.
Here is a sample of 30 columns. https://github.com/txd2x/datastore file name: sample-question2022-01.xlsx
I figured out: stacking and then reshaping.Thx for the help.
for i in np.arange(len(price.columns)/2):
temp =DataFrame(columns = ['Date', 'ClosedPrice','CUSIP'])
temp['Date'] = price.iloc[ 0:np.shape(price)[0]-1, int(2*i)]
temp['ClosedPrice'] = price.iloc[0:np.shape(price)[0]-1, int(2*i 1)]
temp['CUSIP'] =price.columns[int(i*2 1)][:9] #
df = df.append(temp)
#use for loop to stack all the column pairs
df = df.dropna(axis=0, how = 'any') # drop nan rows
df = df.pivot(index='Date', columns = 'CUSIP', values = 'ClosedPrice') #reshape dataframe to have Date as index and CUSIP and column headers
df_monthly=df.resample('M').last() #finding last price of the month
CodePudding user response:
if you want to get rid of unusful columns then perform the following code:
df.drop("name_of_column", axis=1, inplace=True)
if you want to drop empty rows use:
df.drop(df.index[row_number], inplace=True)
if you want to rearrange the data using 'timestamp and date' you need to convert it to a datetime object and then make it as index:
import datetime
df.Date=pd.to_datetime(df.Date)
df = df.set_index('Date')
and you probably want to change column name before doing any of that above, df.rename(columns={'first_column': 'first', 'second_column': 'second'}, inplace = True)
Updated01:
if you want to keep just some columns of those 10000, lets say for example 10 or 7 columns, then use df = df[["first_column","second_column", ....]]
if you want to get rid of all empty columns use: df.dropna(axis=1, how = 'all')
"how" keyword have two values: "all" to drop the whole row or column if it is full of Nan, "any" to drop the whole row or column if it have one Nan at least.
Update02: Now if you have got a lot of date columns and you just want to keep one of them, supposing that you have choosed a date column that have no "Nan" values use the following code:
columns=df.columns.tolist()
for column in columns:
try:
if(df[column].dtypes=='object'):
df[column]=pd.to_datetime(df[column]).
if(df[column].dtypes=='datetime64[ns]')&(column!='Date'):
df.drop(column,axis=1,inplace=True)
except ValueError:
pass
rearrange the dataframe using months:
import datetime
df.Date=pd.to_datetime(df.Date)
df['Month']=df.Date.dt.month
df['Year']=df.Date.dt.year
df = df.set_index('Month')
df.groupby(["Year","Month"]).mean()
update03: To combine all date columns while preserving data use the following code:
import pandas as pd
import numpy as np
df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
if (df[column].isnull().sum()>2300):
df.drop(column,axis=1,inplace=True)
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:
if(df[column].dtypes=='datetime64[ns]'):
df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
else:
df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
merged=pd.merge(merged,df[[columns[i],columns[i 1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
k =1
the only problem left that it will throw a memory Error.
MemoryError: Unable to allocate 27.4 GiB for an array with shape (3677415706,) and data type int64