given a data set that has commas as part of the text, is there a good/easy way to convert them so i can parse the rest of the data using the 'real' commas? the commas I want to ignore/translate are always inside parentheses
#Create Series
s = pd.Series(['one,two,ten','first,second,third(twenty,thirty,forty),last','ten,eleven,twelve'],['buz','bas','bur'])
k = pd.Series(['y','n','o'],['buz','bas','bur'])
#Create DataFrame df from two series
df = pd.DataFrame({'first':s,'second':k})
my thought is that for each row in column first I need to check for a "(" and then if there is a "," convert it to "-". then if I get to the ")" I stop the translation. In the end I will have third(twenty-thirty-forty)
Is there a char by char parser that can be triggered by a "("
expected output:
#Create Series
s = pd.Series(['one,two,ten','first,second,third(twenty-thirty-forty),last','ten,eleven,twelve'],['buz','bas','bur'])
k = pd.Series(['y','n','o'],['buz','bas','bur'])
df = pd.DataFrame({'first':s,'second':k})
CodePudding user response:
Let us try str.replace
with replacement lambda function
repl = lambda g: g.group().replace(',', '-')
df['first'] = df['first'].str.replace(r'\((.*?)\)', repl, regex=True)
print(df)
first second
buz one,two,ten y
bas first,second,third(twenty-thirty-forty),last n
bur ten,eleven,twelve o
CodePudding user response:
You can create a character by character parser and apply to each column:
def replace_comma(x):
# create a list from the string
x_list = [s for s in x]
# create a second list to modify
new_xlist = x_list
# set a flag for when in paranthesis
in_paranthesis = False
# iterate through the list
for count, character in enumerate(x_list):
if character == '(':
in_paranthesis = True
elif character == ')':
in_paranthesis = False
elif character == ',' and in_paranthesis is True:
# if in paranthesis, replace comma with '/'
new_xlist[count] = '/'
# return new_xlist, with /'s, joined as a string
return ('').join(new_xlist)
Use pandas apply on each column:
df = df['first'].apply(replace_comma)
CodePudding user response:
Using str.replace
and regex.
df['first'].str.replace(r"(,(?=[^()]*\)))", '-')
Output:
buz one,two,ten
bas first,second,third(twenty-thirty-forty),last
bur ten,eleven,twelve
Name: first, dtype: object