I have two excel sheets with multiple rows and columns. My task is to compare both excels and print only the matching values. The output has to be print into a new excel. My idea is to use pandas and xlsxwriter engine for this.
Pseudocode:
- Read Excel 1 --> Dataframe 1
- Read Excel 2 --> Dataframe 2
- Filter Excel 2 so that it contains only entries with a particular text "Service_" --> Dataframe 3
- Compare Excel 1 with Filtered Excel 2 (Dataframe 1 with Dataframe 3)
- Print only matching elements from both excels (Dataframe 4)
- Store the output in a new excel with two sheets (Dataframe 3 and Dataframe 4)
The issue I have is Dataframe 3 is printed correctly but Dataframe 4 has some missing values
Dataframe 1
ColumnAlpha
0 Service_1
1 Service_2
2 Service_3
3 Service_4
4 Service_5
5 Service_6
Dataframe 2
ColumnA ColumnB ColumnC
0 Service_1 100 Text1
1 Service_2 110 Text2
2 Sample1 120 Text3
3 Sample2 130 Text4
4 Service_6 140 Text5
5 Service_7 150 Text6
Code:
#Reading Excel 1
data1 = pd.read_excel(r'C:\Users\XXXX\Excel1.xlsx')
df1 = pd.DataFrame(data1, columns= ['ColumnAlpha'])
#Reading Excel 2
data2 = pd.read_excel(r'C:\Users\XXXX\Excel2.xlsx')
df2 = pd.DataFrame(data2, columns= ['ColumnA','ColumnB','ColumnC'])
# Filtering ColumnA containing the text Service_ and sorting
filter_df2 = df2.loc[df2['ColumnA'].str.contains("Service_", case = False)]
clean_df2 = filter_df2.groupby("ColumnA").first().reset_index()
clean_df2.sort_values(by='ColumnB', inplace=True, ascending=[False])
#Comparing ColumnA with ColumnAlpha to filter only matching texts
MatchedData = pd.DataFrame(columns=['ColumnA', 'ColumnB','ColumnC'])
for i in df1.ColumnAlpha:
match_df = clean_df2[clean_df2.ColumnA.str.contains(i)]
MatchedData = MatchedData.append(match_df, ignore_index=True)
# Class 2 Services Interfaces created in the last one week
MatchedData["ThirdColumn"] = clean_df2["ColumnC"]
OutputData = pd.DataFrame(MatchedData, columns= ['FirstColumn','SecondColumn','ThirdColumn'])
OutputData.sort_values(by='SecondColumn', inplace=True)
# Printing Output in Excel
with pd.ExcelWriter(r'C:\Users\XXXX\Output.xlsx', engine='xlsxwriter') as writer: # pylint: disable=abstract-class-instantiated
# Writing Sheet 1
clean_df2.to_excel(writer, sheet_name="All entries", index=False)
#Writing Sheet 2
OutputData.to_excel(writer, sheet_name="Filtered entries", index=False)
Expected Dataframe 3 (Sheet 1 in the output)
ColumnA ColumnB ColumnC
0 Service_7 150 Text6
1 Service_6 140 Text5
2 Service_2 110 Text2
3 Service_1 100 Text1
Expected Dataframe 4 (Expected Sheet 2 in the output)
FirstColumn SecondColumn ThirdColumn
0 Service_6 140 Text5
1 Service_2 110 Text2
2 Service_1 100 Text1
Actual Dataframe 4 (Actual Sheet 2 in the output with empty cell C2)
FirstColumn SecondColumn ThirdColumn
0 Service_6 140 Text5
1 Service_2 110 Text2
2 Service_1 100
What is going wrong here? I am struggling to find why some random values gets omitted when the same value is print in the first sheet of the excel.
Answers or suggestions would be highly appreciated. Thank you very much.
P.S I tried to simplify the question, code and the dataframes as much as possible. My apologies in advance if it still looks complicated.
CodePudding user response:
it seems to me that this problem is related to different column names in MatchedData and OutputData. i tryed init OutputData as
OutputData = pd.DataFrame(MatchedData, columns=['ColumnA', 'ColumnB', 'ColumnC']) And got the expected result.
ps. If pandas is too complex, then for basic tasks you can use the pure xlsxwriter library (without Pandas dataFrames), and control the data using the base Python language like "dict", "set" and etc.
CodePudding user response:
I think you are making things a bit harder than they need to be. Try the following:
# load df1 and df2 (your code should work here)
# =============================================================================
# filter df2 on "Service_", sort on colB desc, and reset index:
df3 = df2.loc[df2['ColumnA'].str.contains('Service_')].copy()
# if you can be certain of asc order original df col, you could here exchange the
# following line for: df3 = df3[::-1]
df3.sort_values(by='ColumnB', ascending=False, inplace=True)
df3.reset_index(drop=True, inplace=True)
# =============================================================================
# df keeping only matches df3.ColumnA in df1.ColumnAlpha (and reset_index):
df4 = df3.loc[df3['ColumnA'].isin(df1['ColumnAlpha'])].copy()
df4.reset_index(drop=True, inplace=True)
df4.columns = ['FirstColumn','SecondColumn','ThirdColumn']
# =============================================================================
print(df3)
ColumnA ColumnB ColumnC
0 Service_7 150 Text6
1 Service_6 140 Text5
2 Service_2 110 Text2
3 Service_1 100 Text1
print(df4)
FirstColumn SecondColumn ThirdColumn
0 Service_6 140 Text5
1 Service_2 110 Text2
2 Service_1 100 Text1
# write to Excel (your code should work here)