I'm teaching myself how to create data pipelines right now by using some data a professor at my alma mater loaned me.
Everything is contained in excel workbooks, and right now, I'm using:
df = pd.read_excel(filename, sheet_name=sheet_titles, skiprows=2, usecols=colstring)
to get just the sheets and the columns from each sheet I need.
This works just fine, but I want to get the data into csv format so I can import it to my postgres DB. The trouble is that 21 of these columns contain the text portion of emails that have been copy pasted over. This, of course, brings over a lot of invalid characters that I now need to remove. The main problem is the line breaks.
Ex.
Thank you for your enquiry.
Please can you provide us with more details.
How many shareholders and directors of the company are your proposing?
Could you expand on and lay out details of what the activities of the company are?
What jurisdictions does the company operate in?
Why have you chosen XXXXXX as a jurisdiction?
Have you sought any tax or legal advice for setting up the operation in XXXXXX?
You discuss making financial transactions. Could you provide details of
what these transactions would be for and in what jurisdictions.?
Thank you and best wishes,
XXXXXX
Firstname Lastname
Director
MyCompany Group Ltd
`
After working on this for several hours, I am still stuck. This is my current (not working) code:
def rem_invalid_chars(input):
valid_chars = ['.','"',',',' ']
s = ''.join(item for item in input if item.isalnum() or item in valid_chars)
return s
for name,sheet in df.items():
for column in sheet.columns[5:26]:
sheet[column].map(rem_invalid_chars, na_action='ignore')
print(sheet[column])
When sheet[column] prints, it prints the values with invalid characters. :(
The function seems to work though, as when I add in a print statement:
def rem_invalid_chars(input):
valid_chars = ['.','"',',',' ']
s = ''.join(item for item in input if item.isalnum() or item in valid_chars)
print(s)
return s
print(s) gives the values without invalid characters.
Any help with this would be greatly appreciated!!
CodePudding user response:
map
returns a Series with the mapped values, but it doesn't change the original Series.
So you can instead assign the modified Series to the old column:
sheet[column] = sheet[column].map(rem_invalid_chars, na_action='ignore')
CodePudding user response:
Like Ignatius pointed out, map doesn't assign values back to the object you've called from. This helped my thinking on it, and I realized the solution was simply to create an intermediate list of dataframes constructed by concatenating the series objects like so:
#initialize
masterlist = []
#iter through sheets (Dataframes in dict)
for name,sheet in df.items():
#assign some columns for tracking file of origin
sheet['Round'] = filename[:-5] #slice to ignore file extension
sheet['Original_Sheet'] = name
#initialize a list to contain cleaned versions of sheets
sheet_list = []
#clean up invalid characters column by column
for i, column in enumerate(sheet.columns):
if i > 4 and i < 26: #or whatever range of columns you need to clean text from
sheet_list.append(sheet[column].map(rem_invalid_chars, na_action='ignore'))
else:
sheet_list.append(sheet[column])
#append a dataframe resulting from concatinating the columns we just cleaned
masterlist.append(pd.concat(sheet_list,axis=1))
#get full table by concatinating masterlist of cleaned sheets
full_table = pd.concat(masterlist)
#export to csv
full_table.to_csv('trythis123.csv')
return