So I have this big .csv in my work that looks something like this:
Name| Adress| Email| Paid Value
John| x street | [email protected]| 0|
Chris| c street | [email protected]| 100|
Rebecca| y street| RebeccaFML|@dmail.com|177|
Bozo | z street| BozoSMH|@yahow.com|976|
As you can see, the .csv is seperated by pipes and the email of the last two people have pipes in it, causing formating problems. There are only 2 customers with this problem but these fellas will have more and more entries every month and we have to manually find them in the csv and change the email by hand . It is a very boring and time consuming process because the file is that big.
We use python to deal with data, I researched a bit and couldn't find anything to help me with it, any ideas?
Edit: So what I want is a way to change this email adresses automatically through code (like RebeccaFML|@dmail.com -> [email protected]). It doenst need to be pandas or anything, I am accepting ideas of any sort. The main thing is I only know how to replace once I read the file in python, but since these registers have the pipes in it, they dont read properly.
Ty in advance
CodePudding user response:
Create a generator for custom rows Here it supposed to set email column at 3rd place, but u can adapt it
import pandas
def rows(path: str, sep: str = '|'):
with open(path) as f:
header = [*f.readline().split(sep), None]
for row in f:
row = row.rsplit('\n', 1)[0].split(sep)
if len(row) > len(header):
yield [*row[:2], ''.join((row[2], row[3])), *row[4:]]
else:
yield row
pandas.DataFrame(rows('data.csv'))
CodePudding user response:
You can pass regex as separator in read_csv
. \|\s*(?!\@)
will split on pipes (possibly followed by spaces), but exclude pipes followed by an at sign. You can subsequently remove the remaining pipes with replace
:
import pandas as pd
import io
data = '''Name| Adress| Email| Paid Value
John| x street | [email protected]| 0|
Chris| c street | [email protected]| 100|
Rebecca| y street| RebeccaFML|@dmail.com|177|
Bozo | z street| BozoSMH|@yahow.com|976|'''
df = pd.read_csv(io.StringIO(data), sep = r'\|\s*(?!\@)', engine='python',index_col=False,usecols=range(4)).replace('\|','', regex=True)
Output:
Name | Adress | Paid Value | ||
---|---|---|---|---|
0 | John | x street | [email protected] | 0 |
1 | Chris | c street | [email protected] | 100 |
2 | Rebecca | y street | [email protected] | 177 |
3 | Bozo | z street | [email protected] | 976 |