I am trying to compare strings with elements inside a dataframe. My strings in the file are like this:
0100000000
0001000000
I would like to pass every line from my file to the position in dataframe, and if the line corresponds, then print the line and its corresponding vector of the dataframe. Something like this:
0100000000 01
0001000000 01
I have this code so far, it's basic and I don't know how to continue
import pandas as pd
data_f = pd.DataFrame(
{'position':
{0: '1000000000',
1: '0100000000',
2: '0010000000',
3: '0001000000',
4: '0000100000',
5: '0000010000'},
'vector': {0: '10', 1: '01', 2: '10', 3: '01', 4: '01', 5: '01'}})
with open("/test_2vec/example_vec61.txt", "r") as f1:
for vec in f1:
print(vec)
CodePudding user response:
Approach #1 (iterative)
Iterate over file's lines and check if a line occurs within a position
column of the dataframe df
:
with open('yourfile.txt') as fin:
for line in fin:
line = line.strip()
vec = df.loc[df['position'].eq(line.strip()), 'vector'].values
if vec.size:
print(line, vec[0])
Approach #2 (merging, the shorter one)
Load the text file with lines to another dataframe to merge it with the initial one on matched lines.
df2 = pd.read_table('yourfile.txt', header=None, dtype=str)
matched_df = df.merge(df2, left_on='position', right_on=0)
print(matched_df.to_string(columns=['position', 'vector'], header=None, index=None))
The output (for the initial input):
0000000000000000000000000001000000000000000000000000000000000 01
0000000000010000000000000000000000000000000000000000000000000 01
0000000000000000000000000000000000000010000000000000000000000 01
0000000000000000000000000000000000000000000000000000000000100 10
0000000000000000000000000000000010000000000000000000000000000 10
0000000000100000000000000000000000000000000000000000000000000 10
0000000000001000000000000000000000000000000000000000000000000 01
0100000000000000000000000000000000000000000000000000000000000 01
0000000000000100000000000000000000000000000000000000000000000 10
CodePudding user response:
You could read the file into a set
then get the positions that are elements of it. The only thing is this doesn't preserve the order of the lines.
with open(...) as f1:
pos = set(line.rstrip('\n') for line in f1)
df_out = data_f.loc[data_f['position'].isin(pos)]
df_out
position vector
1 0100000000 01
3 0001000000 01
Then to print it like you want:
print(df_out.to_string(header=False, index=False))
0100000000 01
0001000000 01