How to get nearest match in csv file python-CodePudding

If want to get the nearest match in my big .csv file in python. My (shortened) .csv file is:

0,4,5,0,132,24055,0,64,6,23215,39635,22,21451751,3233419908,8,0,4126,368,15087,0
0,4,5,16,52,22607,0,64,6,24727,22,39635,3233439332,21453192,8,0,26,501,28207,0
1,4,5,0,40,1727,0,128,6,29216,62281,22,123196295,3338477204,5,0,26,513,30738,0
0,4,5,0,116,24108,0,64,6,23178,39635,22,21452647,3233437508,8,0,4126,644,61163,0
0,4,5,0,724,32046,0,64,6,14632,38655,22,1452688218,1828171762,8,0,4126,343,31853,0
0,4,5,0,76,26502,0,128,6,4405,50266,22,1776918274,3172205875,5,0,4126,512,9381,0
1,4,5,0,40,7662,0,64,6,39665,22,62202,3176642698,3972914889,5,0,26,501,63331,0
1,4,5,0,52,939,0,128,6,29992,62206,22,1466629610,0,8,0,44,64240,43460,0
0,4,5,16,76,10076,0,64,6,37199,22,50268,4016221794,718292575,5,0,4126,501,310,0
0,4,5,0,40,26722,0,128,6,4221,50270,22,38340335,3852724687,5,0,26,510,36549,0
0,4,5,0,76,26631,0,128,6,4276,50266,22,1776920362,3172222235,5,0,4126,511,61692,0
0,4,5,16,148,38558,0,64,6,8680,22,37221,2019795091,3598991383,8,0,4126,501,9098,0
0,4,5,0,52,24058,0,64,6,23292,39635,22,21452135,3233420036,8,0,26,368,38558,0
0,4,5,16,76,10249,0,64,6,37026,22,50266,3172221011,1776919966,5,0,4126,501,31557,0
0,4,5,16,212,38490,0,64,6,8684,22,37221,2019776067,3598991175,8,0,4126,501,56063,0
0,4,5,0,60,0,0,64,6,47342,22,44751,2722242689,3606442876,10,0,4426,65160,29042,0
0,4,5,16,76,10234,0,64,6,37041,22,50266,3172220319,1776919498,5,0,4126,501,49854,0
1,4,5,0,1016,1737,0,128,6,28230,62273,22,3387237183,3449598142,5,0,4126,513,49536,0
1,4,5,0,40,20630,0,64,6,26697,22,62288,4040909519,95375909,5,0,26,501,36104,0
0,4,5,16,180,22591,0,64,6,24615,22,39635,3233437764,21452775,8,0,4126,501,28548,0
0,4,5,0,52,31654,0,64,6,15696,47873,22,3476257438,205382502,8,0,26,368,59804,0
1,4,5,0,320,20922,0,64,6,26125,22,62195,2187234888,2519273239,5,0,4126,501,52263,0
0,4,5,0,1132,22526,0,64,6,23744,22,39635,3233417124,21450447,8,0,4126,509,12391,0
1,4,5,0,52,0,0,64,6,47315,22,62282,3209938138,2722777338,8,0,4426,64240,36683,0
0,4,5,0,52,3091,0,64,6,44259,22,38655,1828172842,1452688914,8,0,26,504,7425,0
0,4,5,16,132,10184,0,64,6,37035,22,50266,3172212167,1776918310,5,0,4126,501,44260,0
0,4,5,16,256,10167,0,64,6,36928,22,50266,3172210503,1776918310,5,0,4126,501,19165,0
1,4,5,0,120,2043,0,128,6,28820,62294,22,644393448,2960970388,5,0,4126,512,36939,0
0,4,5,16,196,38575,0,64,6,8615,22,37221,2019796627,3598991543,8,0,4126,501,29587,0
0,4,5,16,148,22599,0,64,6,24639,22,39635,3233438532,21452967,8,0,4126,501,41316,0
1,4,5,0,88,1733,0,128,6,29162,62267,22,872073945,3114048214,5,0,4126,508,23918,0

I have made a programm, but it isn't finished and I don't know how I can complete it. Do I have to use an another program?:

with open("<dir>", "r") as file:
    file = file.readlines()
len_ = len(file)

string = "4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0" #The string, that I want to find the neares data in the .csv data.

list_ = []

for i in range(1, len_):
    item = str(file[i])
    item2 = item[2:]
    list_.append(item2)
    
for item in list_:

algorithm: Look from left to right on the row and find the row with the most sequential matches to the search data.

CodePudding user response：

It seems you are handling a machine learning problem, with a dataset and a point to find the nearest neighbor. I assume you want the point of the dataset that has the shortest euclidean distance (in 19-dimension) to the given point.

I would use pandas and scikit-learn packages with the NearestNeighbors algorithm. Upload the packages

from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

upload the file.csv as Pandas DataFrame (with generic column names)

df = pd.read_csv('file.csv', index_col=False, names=np.arange(20))

Since you want the first column of values as results, I move it to a Pandas Series called "first_column" and drop it from the "df" dataframe

first_column = df[0]
df.drop(columns=[0], inplace=True)

What you called "string" I call it "y" and set it as numpy array:

 y = np.array([[4,5,0,52,32345,0,64,6,15005,37221,22,3598991799,2019801315,8,0,26,691,17176,0]])

now let's fit the NearestNeighbors model

nnb = NearestNeighbors(n_neighbors=1).fit(df)

and now computes which point in the data set is the closest to the given point y:

distances, indices = nnb.kneighbors(y, n_neighbors=1)
print(indices)
[[13]]

So, the nearest point has index 13 in the dataframe. Let's print the 13th position of the first_column

print(first_column.loc[13])
0