How to find the missing strings from excel/database using python?-CodePudding

ABCD062208260061
ABCD062208260062
ABCD062208260065
EFGH23UR00000525
EFGH23UR00000526
EFGH23UR00000547
JKLM000002345621
JKLM000002345622
JKLM000002345625

From the above list you can see that there are some of the missing strings in the list i.e., in ABCD series 63 and 64 are missing in EFGH series 527 to 546 are missing and in JKLM series 623 and 624 are missing. I need to find out those missing strings. This is just an example there are 0.5 million of data in the original list.

I have written this below code but for 0.5 million of data it is taking too much time and as well as showing memory error

Output should be like

[ABCD062208260063,ABCD062208260064,EFGH23UR00000527,EFGH23UR00000528,.......]

import pandas as pd
import re

initial_int = 0
miss_list= []
tempstr = ''
tempint = 0

df = pd.read_csv("Book2.csv")

def findMissValue(k,x):
    global tempint
    global tempstr
    global initial_int
    re2 = int(f'{x[1]}') 
    if k==0:
        tempstr=x[0]
    if tempstr!=x[0]:
        tempint=k
        tempstr=x[0]
    if k==tempint:
        initial_int = re2
    else:
        initial_int =1
        if initial_int==re2:
            pass
        else:
            dif = re2-initial_int
            for _ in range(dif):
                tt = f"{x[0]}{initial_int}"
                if tt not in miss_list:
                    miss_list.append(tt)
                    initial_int =1

for i, j in df.iterrows():
    value = j[0]
    reList=re.findall(r"[^\W\d_] |\d ", value)
    if len(reList)>2:
        reListnew = []
        ll = ''
        for j in range(len(reList)-1):
            ll = ll   reList[j]
        reListnew.append(ll)
        reListnew.append(reList[-1])
        findMissValue(i,reListnew)
    else:
       findMissValue(i,reList)

print((miss_list))
nn = pd.DataFrame(miss_list)
nn.to_csv("kkkk.csv")

CodePudding user response：

Please use below code to find missing values in excel fields.

string_list=['ABCD062208260061','ABCD062208260062','ABCD062208260065','EFGH23UR00000525','EFGH23UR00000526','EFGH23UR00000547','JKLM000002345621','JKLM000002345622','JKLM000002345625']
    split_list={}
    prefix_len=8
    for item in string_list:
        prefix=item[:prefix_len]
        number=item[prefix_len 1:]
        if prefix not in split_list.keys():
            split_list[prefix]=[]
        split_list[prefix].append(int(number))
    print(split_list)
    for prefix in split_list.keys():
        split_list[prefix].sort()
        missing_value = set(range(split_list[prefix][0], split_list[prefix][-1]   1)) - set(split_list[prefix])
        print("below is missing value for prefix=" prefix)
        print(missing_value)

In your case , below will be the output:

below is missing value for prefix=ABCD0622
{8260064, 8260063}
below is missing value for prefix=EFGH23UR
{527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546}
below is missing value for prefix=JKLM0000
{2345624, 2345623}

CodePudding user response：

Here's a different approach.

Given that the input file only contains one column, we don't need to use any CSV modules to read it.

Identify the last sequence of digits in each line of the file. Everything that precedes that is the key. Build a dictionary using those keys and maintain a note of the lowest and highest values seen so far. You'll also need the length of the numeric string for output formatting. Maintain a set of observed values.

Let's say the input file has this content:

ABCD062208260061
ABCD062208260062
ABCD062208260065
EFGH23UR00000525
EFGH23UR00000526
EFGH23UR00000547
JKLM000002345621
JKLM000002345622
JKLM000002345625

Then assuming the file is sorted:

import re

BOOK2 = 'Book2.csv'
OUTPUT = 'output.csv'


def get_key_and_value(s):
    v = re.findall(r'(\d )', line)[-1]
    l = len(v)
    k = s[:len(s)-l]
    return k, int(v), l

def dump(d, output):
    for k, (lo, hi, _len, _set) in d.items():
        for n in range(lo, hi 1):
            if n not in _set:
                print(f'{k}{n:0{_len}d}', file=output)

db = {}

with open(BOOK2) as book, open(OUTPUT, 'w') as output:
    for line in map(str.strip, book):
        k, v, _len = get_key_and_value(line)
        if (t := db.get(k)) is None:
            dump(db, output)
            db = {k: (v, v, _len, {v})}
        else:
            lo, hi, _len, _set = t
            if v < lo:
                lo = v
            if v > hi:
                hi = v
            _set.add(v)
            db[k] = lo, hi, _len, _set

    dump(db, output)

Which leads to this in the output file:

ABCD062208260063
ABCD062208260064
EFGH23UR00000527
EFGH23UR00000528
EFGH23UR00000529
EFGH23UR00000530
EFGH23UR00000531
EFGH23UR00000532
EFGH23UR00000533
EFGH23UR00000534
EFGH23UR00000535
EFGH23UR00000536
EFGH23UR00000537
EFGH23UR00000538
EFGH23UR00000539
EFGH23UR00000540
EFGH23UR00000541
EFGH23UR00000542
EFGH23UR00000543
EFGH23UR00000544
EFGH23UR00000545
EFGH23UR00000546
JKLM000002345623
JKLM000002345624