I have tried multiple methods that get me to a point close to but not exactly where I want to be with the final output. I am trying to first create a few columns that contain a specific within the raw dataframe based on it's position, afterwards I am trying to make a particular row the header row and skip all the rows that were above it.
Raw input:
| NA | NA_1 | NA_2 | NA_3 |
0 | 12-Month Percent Change | NaN | NaN | NaN |
1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN |
2 | Item: Food | NaN | NaN | NaN |
3 | Year | Jan | Feb | Mar |
4 | 2010 | -0.4 | -0.2 | 0.2 |
5 | 2011 | 1.8 | 2.3 | 2.9 |
Code used:
df1['View Description'] = df1.iat[0,0]
df1['Series ID'] = df1.iat[1,1]
df1['Series Name'] = df1.iat[2,1]
df1
Resulted to:
NA NA.1 NA.2 NA.3 NA.4 NA.5 NA.6 NA.7 View Description Series ID Series Name
0 12-Month Percent Change NaN NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
1 Series Id: CUUR0000SAF1 NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
2 Item: Food NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food
3 Year Jan Feb Mar Apr May Jun Jul 12-Month Percent Change CUUR0000SAF1 Food
4 2010 -0.4 -0.2 0.2 0.5 0.7 0.7 0.9 12-Month Percent Change CUUR0000SAF1 Food
5 2011 1.8 2.3 2.9 3.2 3.5 3.7 4.2 12-Month Percent Change CUUR0000SAF1 Food
6 2012 4.4 3.9 3.3 3.1 2.8 2.7 2.3 12-Month Percent Change CUUR0000SAF1 Food
7 2013 1.6 1.6 1.5 1.5 1.4 1.4 1.4 12-Month Percent Change CUUR0000SAF1 Food
Last thing is I want to make the header the row 3 and remove all the rows above it. BUT still keep the three columns at the end: 1) View Description, Series ID, Series Name.
Any suggestions with an efficient way that this can be done as next I want to scale it up with a for loop or something that would do this process for x10 files.
Thanks in advance!
CodePudding user response:
Here's a way to do what I believe your question is asking:
# Parse and store the first 3 values in column 0 so that we can use them
# as values for 3 new columns later.
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
# Transpose so that we can use set_index() to replace the index
# (the columns from the original df1) to ['Item: Food', NaN, NaN, NaN],
# then transpose back so that the new index becomes the columns.
df1 = df1.T.set_index(3).T
# Use reset_index() to replace the index with a fresh range
# index (0, 1, 2, ...) so we can use iloc() to discard the
# first 3 unwanted rows, then call reset_index() again.
df1 = df1.reset_index(drop=True).iloc[3:].reset_index(drop=True)
# Get rid of vestigial name for columns.
df1.columns.names = [None]
# Add the three new columns set to the values saved earlier.
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
Here is full test case (with the above annotated code compressed into fewer lines):
import pandas as pd
s = [
' | NA | NA_1 | NA_2 | NA_3 |',
'0 | 12-Month Percent Change | NaN | NaN | NaN |',
'1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN |',
'2 | Item: Food | NaN | NaN | NaN |',
'3 | Year | Jan | Feb | Mar |',
'4 | 2010 | -0.4 | -0.2 | 0.2 |',
'5 | 2011 | 1.8 | 2.3 | 2.9 |']
df1 = pd.DataFrame(
[[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()]
df1 = df1.T.set_index(3).T.reset_index(drop=True).iloc[3:].reset_index(drop=True)
df1.columns.names = [None]
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)
Output:
NA NA_1 NA_2 NA_3
0 12-Month Percent Change NaN NaN NaN
1 Series Id: CUUR0000SAF1 NaN NaN NaN
2 Item: Food NaN NaN NaN
3 Year Jan Feb Mar
4 2010 -0.4 -0.2 0.2
5 2011 1.8 2.3 2.9
Year Jan Feb Mar View Description Series ID Series Name
0 2010 -0.4 -0.2 0.2 12-Month Percent Change CUUR0000SAF1 Food
1 2011 1.8 2.3 2.9 12-Month Percent Change CUUR0000SAF1 Food
UPDATE: This is code that allows us to configure (1) the cell coordinates of each of 3 cells to be used for new column values (new_col_coords
) and (2) the header_row
above which rows are discarded:
import pandas as pd
s = [
' | NA | NA_1 | NA_2 | NA_3 |',
'0 | 12-Month Percent Change | NaN | NaN | NaN |',
'91 | To be discarded | NaN | NaN | NaN |',
'1 | Series Id: CUUR0000SAF1 | Abc | NaN | NaN |',
'92 | To be discarded | NaN | NaN | NaN |',
'93 | To be discarded | NaN | NaN | NaN |',
'94 | To be discarded | NaN | NaN | NaN |',
'2 | Item: Food | Xyz | NaN | NaN |',
'95 | To be discarded | NaN | NaN | NaN |',
'96 | To be discarded | NaN | NaN | NaN |',
'97 | To be discarded | NaN | NaN | NaN |',
'98 | To be discarded | NaN | NaN | NaN |',
'3 | Year | Jan | Feb | Mar |',
'4 | 2010 | -0.4 | -0.2 | 0.2 |',
'5 | 2011 | 1.8 | 2.3 | 2.9 |']
df1 = pd.DataFrame(
[[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]],
columns = [x.strip() for x in s[0].split('|')[1:-1]],
)
print(df1)
# parse and store the 3 values at specified coordinates so that we can use them as values for 3 new columns later
new_col_coords = [[0,0], [2,1], [6,1]]
new_columns = [x.split(':')[-1].strip() for x in [df1.iloc[i, j] for i, j in new_col_coords]]
header_row = 11
# Here's how to do everything that follows in one line of code:
#df1 = df1.T.set_index(header_row).T.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)
# Transpose so that we can use set_index() to change the index to ['Item: Food', NaN, NaN, NaN], then transpose back so that index becomes the columns
df1 = df1.T.set_index(header_row).T
# Use reset_index() to replace the index with a fresh range index (0, 1, 2, ...) so we can use iloc() to discard the unwanted rows above header_row, then call reset_index() again
df1 = df1.reset_index(drop=True).iloc[header_row:].reset_index(drop=True)
# Get rid of vestigial name for columns
df1.columns.names = [None]
# Add the three new columns set to the values saved earlier
df1[['View Description', 'Series ID', 'Series Name']] = new_columns
print(df1)
Output:
NA NA_1 NA_2 NA_3
0 12-Month Percent Change NaN NaN NaN
1 To be discarded NaN NaN NaN
2 Series Id: CUUR0000SAF1 Abc NaN NaN
3 To be discarded NaN NaN NaN
4 To be discarded NaN NaN NaN
5 To be discarded NaN NaN NaN
6 Item: Food Xyz NaN NaN
7 To be discarded NaN NaN NaN
8 To be discarded NaN NaN NaN
9 To be discarded NaN NaN NaN
10 To be discarded NaN NaN NaN
11 Year Jan Feb Mar
12 2010 -0.4 -0.2 0.2
13 2011 1.8 2.3 2.9
Year Jan Feb Mar View Description Series ID Series Name
0 2010 -0.4 -0.2 0.2 12-Month Percent Change Abc Xyz
1 2011 1.8 2.3 2.9 12-Month Percent Change Abc Xyz