Home > OS >  Python Pandas - Email column has values with the delimiter on it
Python Pandas - Email column has values with the delimiter on it

Time:10-09

So I have this big .csv in my work that looks something like this:

Name| Adress| Email| Paid Value    
John| x street | [email protected]| 0|
Chris| c street | [email protected]| 100|
Rebecca| y street| RebeccaFML|@dmail.com|177|
Bozo | z street| BozoSMH|@yahow.com|976|

As you can see, the .csv is seperated by pipes and the email of the last two people have pipes in it, causing formating problems. There are only 2 customers with this problem but these fellas will have more and more entries every month and we have to manually find them in the csv and change the email by hand . It is a very boring and time consuming process because the file is that big.

We use python to deal with data, I researched a bit and couldn't find anything to help me with it, any ideas?

Edit: So what I want is a way to change this email adresses automatically through code (like RebeccaFML|@dmail.com -> [email protected]). It doenst need to be pandas or anything, I am accepting ideas of any sort. The main thing is I only know how to replace once I read the file in python, but since these registers have the pipes in it, they dont read properly.

Ty in advance

CodePudding user response:

Create a generator for custom rows Here it supposed to set email column at 3rd place, but u can adapt it

import pandas
def rows(path: str, sep: str = '|'):
    with open(path) as f:
        header = [*f.readline().split(sep), None]
        for row in f:
            row = row.rsplit('\n', 1)[0].split(sep)
            if len(row) > len(header):
                yield [*row[:2], ''.join((row[2], row[3])), *row[4:]]
            else:
                yield row
pandas.DataFrame(rows('data.csv'))

CodePudding user response:

You can pass regex as separator in read_csv. \|\s*(?!\@) will split on pipes (possibly followed by spaces), but exclude pipes followed by an at sign. You can subsequently remove the remaining pipes with replace:

import pandas as pd
import io

data = '''Name| Adress| Email| Paid Value    
John| x street | [email protected]| 0|
Chris| c street | [email protected]| 100|
Rebecca| y street| RebeccaFML|@dmail.com|177|
Bozo | z street| BozoSMH|@yahow.com|976|'''

df = pd.read_csv(io.StringIO(data), sep = r'\|\s*(?!\@)', engine='python',index_col=False,usecols=range(4)).replace('\|','', regex=True)

Output:

Name Adress Email Paid Value
0 John x street [email protected] 0
1 Chris c street [email protected] 100
2 Rebecca y street [email protected] 177
3 Bozo z street [email protected] 976
  • Related