I asked the question about reading the xml data to pandas dataframe
I got the following answer
medlinecitation = pd.read_xml("Taxonomy_NLP/public_dat/trainset/17846141.xml", xpath=".//medlinecitation").dropna(axis=1)
abstract = pd.read_xml("Taxonomy_NLP/public_dat/trainset/17846141.xml", xpath=".//abstract")
dfa = pd.merge(
left=medlinecitation,
right=abstract,
how="outer",
left_index=True,
right_index=True,
).fillna(method="ffill")
The following output
owner status pmid citationsubset otherid abstracttext
0 NLM MEDLINE 17846141 IM PMC2151464 Empirical use of beta-lactam antibiotics, the ...
I have the following files in my folder and want to read each file like above and make a data frame for each row representing the file. I could read all the files through the following function, but don't know how to make data frame with all the files.
for dirname, _, filenames in os.walk(path):
for filename in filenames:
print(os.path.join(dirname, filename))
Taxonomy_NLP/public_dat/trainset/17846141.xml
Taxonomy_NLP/public_dat/trainset/10649814.xml
Taxonomy_NLP/public_dat/trainset/20091541.xml
Taxonomy_NLP/public_dat/trainset/11493721.xml
Taxonomy_NLP/public_dat/trainset/11505031.xml
Taxonomy_NLP/public_dat/trainset/14557142.xml
Taxonomy_NLP/public_dat/trainset/15174889.xml
Taxonomy_NLP/public_dat/trainset/1565551.xml
Taxonomy_NLP/public_dat/trainset/15159270.xml
Taxonomy_NLP/public_dat/trainset/12837416.xml
Taxonomy_NLP/public_dat/trainset/10629474.xml
CodePudding user response:
I'm not sure what are the relationships between the file in the directory and the two paths you mention above.
This is iteration on the data you have:
def xml_to_df(path1, path2):
medlinecitation = pd.read_xml(path1, xpath=".//medlinecitation").dropna(axis=1)
abstract = pd.read_xml(path2, xpath=".//abstract")
dfa = pd.merge(
left=medlinecitation,
right=abstract,
how="outer",
left_index=True,
right_index=True,
).fillna(method="ffill")
list_of_dfs = list()
for dirname, _, filenames in os.walk(path):
for filename in filenames:
path1 = os.path.join(dirname, filename)
list_of_dfs.append(xml_to_df(path1, path2))
CodePudding user response:
Your code seems to be loading two different elements from the same XML file. You can create a function to do this which returns the new dataframe:
def read_gct(path):
medlinecitation = pd.read_xml(path, xpath=".//medlinecitation")
.dropna(axis=1)
abstract = pd.read_xml(path, xpath=".//abstract")
dfa = pd.merge(
left=medlinecitation,
right=abstract,
how="outer",
left_index=True,
right_index=True,
).fillna(method="ffill")
return dfa
You can use pathlib.glob to find all XML files in a path and load a dataframe for each one. Finally, you can concatenate all dataframes into a single one with pd.concat:
fileDfs=(read_gct(path) for path in pathlib.glob('root/**/*.xml'))
final_df=pd.concat(fileDfs)