Get a secuence from a string using regular expressions #Python #Regex-CodePudding

I would like your help using #Python.

I have this dataset:

E   1   1999-02-28  b,g,f    jjj:12,bbb:3,ddd:9,ggg:8,hhh:2
A   2   1999-10-28  a,f,c,d  ccc:2,ddd:0,aaa:3,hhh:9

I need to get the secuence b,g,f and a,f,c,d in a list. I tried using many combinations of the pattern [a-z],[a-z], but every time the last term is skipped, I do not know how to generalize to get the sequence.

The output should look like this:

[b,g,f]
[a,f,c,d]

The dataset comes from a csv file, I'm reading like this:

with open("data.csv", "r") as file:
    lines = file.readlines()

Then using a for loop to read the lines:

list_sequence = []
for i in lines:
    a = re.findall(pattern= '???' , string=str(i))
    list_sequence.append(b)

In the question marks, is where I need to find the pattern.

CodePudding user response：

You can use

(?<!\S)[a-z](?:,[a-z])*(?!\S)

See the regex demo. Details:

(?<!\S) - left whitespace boundary
[a-z](?:,[a-z])* - a lowercase ASCII letter and then zero or more sequences of a comma and a lowercase ASCII letter
(?!\S) - right whitespace boundary.

CodePudding user response：

You can try the below - (split each line to fields and split the forth field once again)

with open('in.txt') as f:
  data = []
  for line in f:
    parts = line.split()
    data.append(parts[3].split(','))
print(data)

output

[['b', 'g', 'f'], ['a', 'f', 'c', 'd']]