How to extract all the cells between two other cells in Pandas from the column with most cells betwe-CodePudding

I have multiple financial statements and not all of them have the same entries - some have more entries than others - and I'd like to consolidate them all into a single one that has all of the entries.

I was able to do it manually in Excel since it's not that many, but I'd like to have the computer do it to double check I got them all.

So, here's what I did: I created a dataframe where each column has the entry names from one of the financial statements.

FinancialStatement1	FinancialStatement2	FinancialStatement3
REVENUES	REVENUES	REVENUES
Revenue1	Revenue1	Revenue1
Revenue2	Revenue2	Revenue2
EXPENSES	EXPENSES	Revenue3
Expense1	Expense1	EXPENSES
Expense2	PROFIT	Expense1
Expense3	-	Expense2
PROFIT	-	PROFIT
-	-	-

My idea was to run a script that would analyse the number of cells between two of the group titles and return to a 'Consolidated' column all of the strings between these two values, including the first, but not the last.

My end result would look like this:

FinancialStatement1	FinancialStatement2	FinancialStatement3	Consolidated
REVENUES	REVENUES	REVENUES	REVENUES
Revenue1	Revenue1	Revenue1	Revenue1
Revenue2	Revenue2	Revenue2	Revenue2
EXPENSES	EXPENSES	Revenue3	Revenue3
Expense1	Expense1	EXPENSES	EXPENSES
Expense2	PROFIT	Expense1	Expense1
Expense3	-	Expense2	Expense2
PROFIT	-	PROFIT	Expense3
-	-	-	PROFIT

I'm a beginner in Pandas and, so far, here's what I came up with by searching here in stack:

df = pd.read_excel(file)
df['Consolidated']=0
df.head()

df['Consolidated'].iloc[1] = df['FinancialStatement1'][df['FinancialStatement1'].between(
    'REVENUES', 'EXPENSES',
    inclusive=False
    )].tolist()

However, this code gives me "A value is trying to be set on a copy of a slice from a DataFrame". I tried using only df.iloc[3,0] but it also doesn't work. Anyway, this code won't do what I want anyway since it doesn't pick the list with most items between the two group titles.

CodePudding user response：

You can create a SET of all the entries in the column. Python SET will eliminate duplicates. Then you cast to a list and create a pd with the superset of all the entries and join it back to the dataframe.

s = set()
for col in df.columns:
    s |= set(df[col])
l = sorted(list(s - set(['-'])))

df1 = pd.DataFrame(l,columns = ['Consolidated'])
df = df.join(df1,how='outer').replace(np.nan,'-')

	FinancialStatement1	FinancialStatement2	FinancialStatement3	Consolidated
0	REVENUES	REVENUES	REVENUES	EXPENSES
1	Revenue1	Revenue1	Revenue1	Expense1
2	Revenue2	Revenue2	Revenue2	Expense2
3	EXPENSES	EXPENSES	Revenue3	Expense3
4	Expense1	Expense1	EXPENSES	PROFIT
5	Expense2	PROFIT	Expense1	REVENUES
6	Expense3	-	Expense2	Revenue1
7	PROFIT	-	PROFIT	Revenue2
8	-	-	-	Revenue3

CodePudding user response：

If you're looking to just get a list of all line items needed to cover all financial statments, you can do the following

l = list(set(df.unstack().values.tolist()))
l.sort()
l.remove('-')
print(l)

['EXPENSES', 'Expense1', 'Expense2', 'Expense3', 'PROFIT', 'REVENUES', 'Revenue1', 'Revenue2', 'Revenue3']

If you want to just pull out the needed Revenues:

revs = [val for val in l if 'Revenue' in val]
revs.sort()
print(revs)

['Revenue1', 'Revenue2', 'Revenue3']

If you want to just pull out the needed Expenses:

exps = [val for val in l if 'Expense' in val]
exps.sort()
print(exps)

['Expense1', 'Expense2', 'Expense3']

You can piece together the data how you want an add the column to existing df or create a new one

line_items =['REVENUES']   revs   ['EXPENSES']   exps   ['PROFIT']
df1 = pd.DataFrame({'consolidated':line_items})
df_final = pd.concat([df,df1], axis=1, ignore_index=True)
df_final.columns = ['FinancialStatement1', 'FinancialStatement2', 'FinancialStatement3', 'consollidated']
df_final

  FinancialStatement1 FinancialStatement2 FinancialStatement3 consollidated
0            REVENUES            REVENUES            REVENUES      REVENUES
1            Revenue1            Revenue1            Revenue1      Revenue1
2            Revenue2            Revenue2            Revenue2      Revenue2
3            EXPENSES            EXPENSES            Revenue3      Revenue3
4            Expense1            Expense1            EXPENSES      EXPENSES
5            Expense2              PROFIT            Expense1      Expense1
6            Expense3                   -            Expense2      Expense2
7              PROFIT                   -              PROFIT      Expense3
8                 NaN                 NaN                 NaN        PROFIT