Some cells are empty when printing pandas output with xlsxwriter-CodePudding

I have two excel sheets with multiple rows and columns. My task is to compare both excels and print only the matching values. The output has to be print into a new excel. My idea is to use pandas and xlsxwriter engine for this.

Pseudocode:

Read Excel 1 --> Dataframe 1
Read Excel 2 --> Dataframe 2
Filter Excel 2 so that it contains only entries with a particular text "Service_" --> Dataframe 3
Compare Excel 1 with Filtered Excel 2 (Dataframe 1 with Dataframe 3)
Print only matching elements from both excels (Dataframe 4)
Store the output in a new excel with two sheets (Dataframe 3 and Dataframe 4)

The issue I have is Dataframe 3 is printed correctly but Dataframe 4 has some missing values

Dataframe 1

  ColumnAlpha     
0 Service_1   
1 Service_2   
2 Service_3   
3 Service_4   
4 Service_5   
5 Service_6

Dataframe 2

   ColumnA    ColumnB  ColumnC 
0 Service_1    100     Text1
1 Service_2    110     Text2
2 Sample1      120     Text3
3 Sample2      130     Text4
4 Service_6    140     Text5
5 Service_7    150     Text6

Code:

#Reading Excel 1 
data1 = pd.read_excel(r'C:\Users\XXXX\Excel1.xlsx')
df1 = pd.DataFrame(data1, columns= ['ColumnAlpha'])

#Reading Excel 2
data2 = pd.read_excel(r'C:\Users\XXXX\Excel2.xlsx')
df2 = pd.DataFrame(data2, columns= ['ColumnA','ColumnB','ColumnC'])
    
# Filtering ColumnA containing the text Service_ and sorting
filter_df2 = df2.loc[df2['ColumnA'].str.contains("Service_", case = False)]
clean_df2 = filter_df2.groupby("ColumnA").first().reset_index()
clean_df2.sort_values(by='ColumnB', inplace=True, ascending=[False])

#Comparing ColumnA with ColumnAlpha to filter only matching texts
MatchedData = pd.DataFrame(columns=['ColumnA', 'ColumnB','ColumnC'])
for i in df1.ColumnAlpha:
    match_df = clean_df2[clean_df2.ColumnA.str.contains(i)]
    MatchedData = MatchedData.append(match_df, ignore_index=True)

# Class 2 Services Interfaces created in the last one week
MatchedData["ThirdColumn"] = clean_df2["ColumnC"]

OutputData = pd.DataFrame(MatchedData, columns= ['FirstColumn','SecondColumn','ThirdColumn'])
OutputData.sort_values(by='SecondColumn', inplace=True)

# Printing Output in Excel
with pd.ExcelWriter(r'C:\Users\XXXX\Output.xlsx', engine='xlsxwriter') as writer:      # pylint: disable=abstract-class-instantiated
    # Writing Sheet 1
    clean_df2.to_excel(writer, sheet_name="All entries", index=False)  
    
    #Writing Sheet 2             
    OutputData.to_excel(writer, sheet_name="Filtered entries", index=False)

Expected Dataframe 3 (Sheet 1 in the output)

    ColumnA   ColumnB  ColumnC 
 0 Service_7  150       Text6
 1 Service_6  140       Text5
 2 Service_2  110       Text2
 3 Service_1  100       Text1

Expected Dataframe 4 (Expected Sheet 2 in the output)

   FirstColumn  SecondColumn   ThirdColumn 
 0 Service_6    140            Text5
 1 Service_2    110            Text2
 2 Service_1    100            Text1

Actual Dataframe 4 (Actual Sheet 2 in the output with empty cell C2)

   FirstColumn  SecondColumn   ThirdColumn 
 0 Service_6    140            Text5
 1 Service_2    110            Text2
 2 Service_1    100

What is going wrong here? I am struggling to find why some random values gets omitted when the same value is print in the first sheet of the excel.

Answers or suggestions would be highly appreciated. Thank you very much.

P.S I tried to simplify the question, code and the dataframes as much as possible. My apologies in advance if it still looks complicated.

CodePudding user response：

it seems to me that this problem is related to different column names in MatchedData and OutputData. i tryed init OutputData as

OutputData = pd.DataFrame(MatchedData, columns=['ColumnA', 'ColumnB', 'ColumnC']) And got the expected result.

ps. If pandas is too complex, then for basic tasks you can use the pure xlsxwriter library (without Pandas dataFrames), and control the data using the base Python language like "dict", "set" and etc.

CodePudding user response：

I think you are making things a bit harder than they need to be. Try the following:

# load df1 and df2 (your code should work here)

# =============================================================================
# filter df2 on "Service_", sort on colB desc, and reset index:

df3 = df2.loc[df2['ColumnA'].str.contains('Service_')].copy()
# if you can be certain of asc order original df col, you could here exchange the
# following line for: df3 = df3[::-1]
df3.sort_values(by='ColumnB', ascending=False, inplace=True) 
df3.reset_index(drop=True, inplace=True)

# =============================================================================
# df keeping only matches df3.ColumnA in df1.ColumnAlpha (and reset_index):

df4 = df3.loc[df3['ColumnA'].isin(df1['ColumnAlpha'])].copy()
df4.reset_index(drop=True, inplace=True)
df4.columns = ['FirstColumn','SecondColumn','ThirdColumn']

# =============================================================================

print(df3)

     ColumnA  ColumnB ColumnC
0  Service_7      150   Text6
1  Service_6      140   Text5
2  Service_2      110   Text2
3  Service_1      100   Text1

print(df4)

  FirstColumn  SecondColumn ThirdColumn
0   Service_6           140       Text5
1   Service_2           110       Text2
2   Service_1           100       Text1

# write to Excel (your code should work here)