I have a Spreadsheet of Time series Data
It has separate tables for each year with a single row gap in between. I want to have the Year from table header as part of the date column. So that I can plot charts and do simple comparisons of the data (YoY) etc
import pandas as pd
RigCountWorld_df = pd.read_excel(open('Worldwide Rig Count Nov 2022.xlsx', 'rb'),
sheet_name='Worldwide_Rigcount',index_col=None, header=6)
RigCountWorld_df
The code I have is no where near helpful. Even the names of pandas operations I need to use will be helpful for me.
I need a continuous table with data from all years. It would make sense to have latest data at the very end.
Even transposing the tables separately and adding them as new columns would make sense (with the column headers containing Year-Month names.
CodePudding user response:
You can use the pandas melt()
function to reshape the data from wide to long format, so that each row contains a single observation. From there, you can use the pandas concat()
function to join the melted dataframes from each sheet together into a single dataframe.
To add the year from the sheet header to the date column, you can use the pandas assign()
function to create a new column with the year from the sheet name. Then, use the pandas datetime.strptime()
function to convert the existing date column into datetime objects with the year from the new column.
Here's an example of how to do it:
import pandas as pd
# create a list of sheet names
sheets = ['Worldwide_Rigcount_2020', 'Worldwide_Rigcount_2021', 'Worldwide_Rigcount_2022']
# create an empty list to store the dataframes
dfs = []
# loop through each sheet
for sheet in sheets:
# read the sheet into a dataframe
df = pd.read_excel('Worldwide Rig Count Nov 2022.xlsx', sheet_name=sheet, index_col=None, header=6)
# extract the year from the sheet name
year = int(sheet.split('_')[-1])
# melt the dataframe from wide to long format
df_melted = df.melt(id_vars=['Date'])
# add the year from the sheet name to the dataframe
df_melted = df_melted.assign(year=year)
# convert the date column to datetime objects
df_melted['Date'] = pd.to_datetime(df_melted['Date'], format='%b %Y', errors='coerce').dt.strftime('%Y-%m-%d') '-' df_melted['year'].astype(str)
# append the dataframe to the list
dfs.append(df_melted)
# concat the list of dataframes into a single dataframe
df_final = pd.concat(dfs)
# print the final dataframe
print(df_final)
CodePudding user response:
Here is a proposition with some of pandas
built-in functions.
import pandas as pd
df = pd.read_excel("Worldwide Rig Count Nov 2022.xlsx",
sheet_name="Worldwide_Rigcount", header=None, usecols="B:K", skiprows=6)
df.dropna(how="all", inplace=True)
df.insert(0, "Year", np.where(df[10].eq("Total World"), df[1], None))
df["Year"].ffill(inplace=True)
df.drop_duplicates(subset= df.columns[2:], inplace=True)
df.columns = ["Year", "Month"] df.loc[0, 2:].tolist()
df = df.loc[1:, :].reset_index(drop=True)
# Output :
print(df.sample(5).to_string())
Year Month Latin America Europe Africa Middle East Asia Pacific Total Intl. Canada U.S. Total World
613 1975 Mar 310 113 122 173 208 926 192 1651 2769
588 1977 Apr 324 135 165 185 167 976 129 1907 3012
596 1977 Dec 353 142 172 195 182 1044 259 2141 3444
221 2005 Jan 307 57 50 242 204 860 550 1255 2665
566 1979 Aug 440 149 199 144 219 1151 376 2222 3749
# Check :
48 years with 13 rows (12 months Average) for each year.
print(df.groupby("Year").size().value_counts())
13 48
dtype: int64
CodePudding user response:
Another solution validating year and month content (it is assumed, that the column names are in the first row of RigCountWorld_df
):
df = RigCountWorld_df.copy()
first_col = 2 # First column with data
column_names = df.iloc[0, first_col:].to_list()
df["Year"] = df.iloc[:,[1]].where(df.iloc[:,1].astype(str).str.match(r"^20\d\d$"), None).ffill()
df["Month"] = df.iloc[:,[1]].where(df.iloc[:,1].astype(str).isin(("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), None)
df = df[df['Month'].notna()]
df = df.iloc[:, first_col:].set_index(["Year", "Month"])
df.columns = column_names
df