How do I parse a text file and get a count of unique values?-CodePudding

I have file:

If MARA.MTART in ('ZPLW', 'ZFTW'), then MARA.PSTAT like '%K%'
If MARA.MTART in ('ZPLW', 'ZFTW'), then MARA.MATKL = '99999999'

and I want to parse it by, adding each Word after the "." to a list (MTART, PSTAT,MATKL) and if it is already in the list, to skip it.

so the list would be:

list = ['MTART', 'PSTAT', 'MATKL']

I am not sure how to go about this.

CodePudding user response：

Python:

This can easily be accomplished using regular expressions, via the re library. Documentation for the .findall() method can be found here.

The lines of the data file are iterated and the defined textual pattern is searched, with the results being populated to an output list. The duplicate values are dropped from the list using the set() function, as only unique values are allowed in a set.

Pattern explanation: '\.([A-Z] )'

Find a full stop (.)
Search for and capture one or more upper case characters, and stop capturing when the first non-uppercase character is found.

Example code:

import re

rexp = re.compile('\.([A-Z] )')
found = []

with open('./mara.csv') as f:
    for line in f:
        found.extend(rexp.findall(line))
        
list(set(found))

Output:

['MTART', 'MATKL', 'PSTAT']

GNU:

On the other hand, if you'd like to use GNU tools instead, this can be accomplished using:

grep -Eo "\.([A-Z] )" mara.csv | awk -F. '{print $2}' | sort | uniq

Output:

MATKL
MTART
PSTAT

CodePudding user response：

Given:

txt="""\
If MARA.MTART in ('ZPLW', 'ZFTW'), then MARA.PSTAT like '%K%'
If MARA.MTART in ('ZPLW', 'ZFTW'), then MARA.MATKL = '99999999'"""

Just use a set comprehension:

>>> {w.partition('.')[2] for w in txt.split() if '.' in w}
{'MATKL', 'MTART', 'PSTAT'}

CodePudding user response：

import re
filename='yourcsv.csv'
with open(filename) as f: 
    ip = f.read()
test=list(set(re.findall(r'.\.(.*?) ', ip)))
print(test)

Output: ['MATKL', 'MTART', 'PSTAT']

CodePudding user response：

{m,g}awk '{ gsub(_,     "", $!(NF = NF))
             sub("..",    "&"($__)"|",_)
            gsub("[|] ",          "|",_) } END {
            gsub("^\\^[|]|[|][$]$","",_)
 
 print _ }' FS='(^| )[^.] MARA[.]| [^.]*$' OFS='|' \_='^|$'

MATKL|MTART|PSTAT