Home > Net >  How to extract all the cells between two other cells in Pandas from the column with most cells betwe
How to extract all the cells between two other cells in Pandas from the column with most cells betwe

Time:10-28

I have multiple financial statements and not all of them have the same entries - some have more entries than others - and I'd like to consolidate them all into a single one that has all of the entries.

I was able to do it manually in Excel since it's not that many, but I'd like to have the computer do it to double check I got them all.

So, here's what I did: I created a dataframe where each column has the entry names from one of the financial statements.

FinancialStatement1 FinancialStatement2 FinancialStatement3
REVENUES REVENUES REVENUES
Revenue1 Revenue1 Revenue1
Revenue2 Revenue2 Revenue2
EXPENSES EXPENSES Revenue3
Expense1 Expense1 EXPENSES
Expense2 PROFIT Expense1
Expense3 - Expense2
PROFIT - PROFIT
- - -

My idea was to run a script that would analyse the number of cells between two of the group titles and return to a 'Consolidated' column all of the strings between these two values, including the first, but not the last.

My end result would look like this:

FinancialStatement1 FinancialStatement2 FinancialStatement3 Consolidated
REVENUES REVENUES REVENUES REVENUES
Revenue1 Revenue1 Revenue1 Revenue1
Revenue2 Revenue2 Revenue2 Revenue2
EXPENSES EXPENSES Revenue3 Revenue3
Expense1 Expense1 EXPENSES EXPENSES
Expense2 PROFIT Expense1 Expense1
Expense3 - Expense2 Expense2
PROFIT - PROFIT Expense3
- - - PROFIT

I'm a beginner in Pandas and, so far, here's what I came up with by searching here in stack:

df = pd.read_excel(file)
df['Consolidated']=0
df.head()

df['Consolidated'].iloc[1] = df['FinancialStatement1'][df['FinancialStatement1'].between(
    'REVENUES', 'EXPENSES',
    inclusive=False
    )].tolist()

However, this code gives me "A value is trying to be set on a copy of a slice from a DataFrame". I tried using only df.iloc[3,0] but it also doesn't work. Anyway, this code won't do what I want anyway since it doesn't pick the list with most items between the two group titles.

CodePudding user response:

You can create a SET of all the entries in the column. Python SET will eliminate duplicates. Then you cast to a list and create a pd with the superset of all the entries and join it back to the dataframe.

s = set()
for col in df.columns:
    s |= set(df[col])
l = sorted(list(s - set(['-'])))

df1 = pd.DataFrame(l,columns = ['Consolidated'])
df = df.join(df1,how='outer').replace(np.nan,'-')
FinancialStatement1 FinancialStatement2 FinancialStatement3 Consolidated
0 REVENUES REVENUES REVENUES EXPENSES
1 Revenue1 Revenue1 Revenue1 Expense1
2 Revenue2 Revenue2 Revenue2 Expense2
3 EXPENSES EXPENSES Revenue3 Expense3
4 Expense1 Expense1 EXPENSES PROFIT
5 Expense2 PROFIT Expense1 REVENUES
6 Expense3 - Expense2 Revenue1
7 PROFIT - PROFIT Revenue2
8 - - - Revenue3

CodePudding user response:

If you're looking to just get a list of all line items needed to cover all financial statments, you can do the following

l = list(set(df.unstack().values.tolist()))
l.sort()
l.remove('-')
print(l)

['EXPENSES', 'Expense1', 'Expense2', 'Expense3', 'PROFIT', 'REVENUES', 'Revenue1', 'Revenue2', 'Revenue3']

If you want to just pull out the needed Revenues:

revs = [val for val in l if 'Revenue' in val]
revs.sort()
print(revs)

['Revenue1', 'Revenue2', 'Revenue3']

If you want to just pull out the needed Expenses:

exps = [val for val in l if 'Expense' in val]
exps.sort()
print(exps)

['Expense1', 'Expense2', 'Expense3']

You can piece together the data how you want an add the column to existing df or create a new one

line_items =['REVENUES']   revs   ['EXPENSES']   exps   ['PROFIT']
df1 = pd.DataFrame({'consolidated':line_items})
df_final = pd.concat([df,df1], axis=1, ignore_index=True)
df_final.columns = ['FinancialStatement1', 'FinancialStatement2', 'FinancialStatement3', 'consollidated']
df_final

  FinancialStatement1 FinancialStatement2 FinancialStatement3 consollidated
0            REVENUES            REVENUES            REVENUES      REVENUES
1            Revenue1            Revenue1            Revenue1      Revenue1
2            Revenue2            Revenue2            Revenue2      Revenue2
3            EXPENSES            EXPENSES            Revenue3      Revenue3
4            Expense1            Expense1            EXPENSES      EXPENSES
5            Expense2              PROFIT            Expense1      Expense1
6            Expense3                   -            Expense2      Expense2
7              PROFIT                   -              PROFIT      Expense3
8                 NaN                 NaN                 NaN        PROFIT
  • Related