I have a csv file that has inconsistent spacing after the comma, like this:
534323, 93495443,34234234, 3523423423, 2342342,236555, 6564354344
I have written a function that tries to read in the file and makes the spacing consistent, but it doesn't appear to update anything. After opening the new file created, there is no difference from the original. The function I've written is:
def ensure_consistent_spacing_in_csv(dirpath, original_name, new_name):
with open(dirpath original_name, "r") as f:
data = f.readlines()
for item in data:
if "," in data:
comma_index = item.index(",")
if item[comma_index 1] != " ":
item = item.replace(",", ", ")
with open(dirpath new_name, "w") as f:
f.writelines(data)
Where am I going wrong?
I have looked at the answer to the question here, but I cannot use that method as I need the delimiter to be ", ", which is two characters and hence not allowed. I also tried to follow the method in the sed
answer to the question here using a process.call
system, but that also failed and I don't know bash well so I'm hesitant to go that route and would like to use a pure python method.
Thank you!
CodePudding user response:
Here is how I was able to normalize the spacing given a string from your example
NOTE: I am assuming the content of the file isn't large enough to exceed the available memory since you read it into the list in your code.
NOTE: using regular expressions may not always (read almost never) be he most efficient way to solve a problem, but it gets the job done.
regex = r"(?<=\d)\s*,\s*(?=\d)" # please see the UPD:
test_str = "534323, 93495443,34234234, 3523423423, 2342342,236555, 6564354344"
subst = ", "
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print(result)
will produce
534323, 93495443, 34234234, 3523423423, 2342342, 236555, 6564354344
and for the file with the following context:
1, 2, 3, 4,5,6
1,2,3,4, 5, 6
1, 2,3,4,5,6
I ran
with open('test.csv') as f:
data = f.read()
regex = r"(?<=\d)\s*,\s*(?=\d)" # please see the UPD:
subst = ", "
result = re.sub(regex, subst, data)
print(result)
and got this result:
1, 2, 3, 4, 5, 6
1, 2, 3, 4, 5, 6
1, 2, 3, 4, 5, 6
Alternatively you could use the csv module to read the rows and for each row you would strip() the element.
UPD: The regex could be simplified to
regex = r"\s*,\s*"
CodePudding user response:
The original code has a couple bugs:
- The
if "," in data
condition never evaluates to true.data
is a list, where each item in the list is a string representing one entire line of the file. No single line in the file is,
, so that condition never evaluates to true. To fix it, useif "," in item
. That way it's checking to see if each line has a comma. - There's also a second problem: the
item.index
function returns only the first instance of a comma, so if there's inconsistent spacing twice in one the algorithm does not catch it.
A simple solution that doesn't require regular expressions or sed
or indexing and looking at each word character by character is:
with open(dirpath orig_filename, "r") as f:
for line in f:
new_line = line.replace(" ", "").replace(",", ", ")
with open(dirpath cleaned_filename, "a") as cleaned_data:
cleaned_data.writelines(new_line)
What this is doing is:
for line in f
reads each line of the file.line.replace(" ", "").replace(",", ", "))
first removes all spaces entirely (thanks to @megakarg for the suggestion) from the line, and then makes sure there's a single space after each comma to meet the spec.