My code is intended to read from multiple files (2 examples below) and match digits on multiple lines of each file, and then combine all matches and filenames where found into a dataframe. However, my first issue is that multiple findall outputs are in multiple lines and I'm not sure how to append these lines properly - findall outputs are like:
65
45
78
etc
Two file examples are below:
F1:
trust 65
musca
linca 75
trig
torst 50
F2:
munk 65
liki 34
grub
I want my code to generate the following final dataframe:
Filename score
F1 65
F1 75
F1 50
F2 65
F2 34
My code attempt:
import os
import re
import pandas as pd
final={}
for f in *.txt:
with open(f,"r") as In1:
(filename,ext)=os.path.splitext(f)
for line in In1:
m=re.findall(r'\d ',line)
if len(match) > 0:
all=[]
all.append(m)
final[filename]=all
df=pd.DataFrame(final.items(),columns=['Filename','Score']
Can someone point me in the right direction please?
CodePudding user response:
You can try
df1 = pd.read_csv('file1', header=None)
df2 = pd.read_csv('file2', header=None)
df = (pd.concat([df1.assign(Filename='F1'),
df2.assign(Filename='F2')],
ignore_index=True)
.dropna(subset=1)
.rename(columns={1: 'score'})
.drop(columns=0))
print(df)
score Filename
0 65.0 F1
2 75.0 F1
4 50.0 F1
5 65.0 F2
6 34.0 F2
CodePudding user response:
Here's a way to do what your question asks:
import pandas as pd
from io import StringIO
fileStrings = {
'F1': '''
trust 65
musca
linca 75
trig
torst 50
''',
'F2': '''
munk 65
liki 34
grub
'''
}
res = pd.concat([
pd.DataFrame({
'Filename': k,
'score':pd.read_csv(StringIO(v), header=None, sep=' ').iloc[:,1].dropna()})
for k, v in fileStrings.items()]).reset_index(drop=True)
print(res)
Output:
Filename score
0 F1 65.0
1 F1 75.0
2 F1 50.0
3 F2 65.0
4 F2 34.0
The above example uses the strings read from the two files detailed in the question. Changing the variable fileStrings
to contain the names and string contents of any number of files will also work.