Reading all the XML files to make dataframe-CodePudding

I asked the question about reading the xml data to pandas dataframe

I got the following answer

medlinecitation = pd.read_xml("Taxonomy_NLP/public_dat/trainset/17846141.xml", xpath=".//medlinecitation").dropna(axis=1)
abstract = pd.read_xml("Taxonomy_NLP/public_dat/trainset/17846141.xml", xpath=".//abstract")

dfa = pd.merge(
    left=medlinecitation,
    right=abstract,
    how="outer",
    left_index=True,
    right_index=True,
).fillna(method="ffill")

The following output

    owner   status  pmid    citationsubset  otherid     abstracttext
0   NLM     MEDLINE     17846141    IM  PMC2151464  Empirical use of beta-lactam antibiotics, the ...

I have the following files in my folder and want to read each file like above and make a data frame for each row representing the file. I could read all the files through the following function, but don't know how to make data frame with all the files.

for dirname, _, filenames in os.walk(path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Taxonomy_NLP/public_dat/trainset/17846141.xml
Taxonomy_NLP/public_dat/trainset/10649814.xml
Taxonomy_NLP/public_dat/trainset/20091541.xml
Taxonomy_NLP/public_dat/trainset/11493721.xml
Taxonomy_NLP/public_dat/trainset/11505031.xml
Taxonomy_NLP/public_dat/trainset/14557142.xml
Taxonomy_NLP/public_dat/trainset/15174889.xml
Taxonomy_NLP/public_dat/trainset/1565551.xml
Taxonomy_NLP/public_dat/trainset/15159270.xml
Taxonomy_NLP/public_dat/trainset/12837416.xml
Taxonomy_NLP/public_dat/trainset/10629474.xml

CodePudding user response：

I'm not sure what are the relationships between the file in the directory and the two paths you mention above.

This is iteration on the data you have:

def xml_to_df(path1, path2):
    medlinecitation = pd.read_xml(path1, xpath=".//medlinecitation").dropna(axis=1)
    abstract = pd.read_xml(path2, xpath=".//abstract")

    dfa = pd.merge(
    left=medlinecitation,
    right=abstract,
    how="outer",
    left_index=True,
    right_index=True,
).fillna(method="ffill")

list_of_dfs = list()


for dirname, _, filenames in os.walk(path):
    for filename in filenames:
        path1 = os.path.join(dirname, filename)
        list_of_dfs.append(xml_to_df(path1, path2))

CodePudding user response：

Your code seems to be loading two different elements from the same XML file. You can create a function to do this which returns the new dataframe:

def read_gct(path):
  medlinecitation = pd.read_xml(path, xpath=".//medlinecitation")
                      .dropna(axis=1)
  abstract = pd.read_xml(path, xpath=".//abstract")

  dfa = pd.merge(
      left=medlinecitation,
      right=abstract,
      how="outer",
      left_index=True,
      right_index=True,
  ).fillna(method="ffill")
  return dfa

You can use pathlib.glob to find all XML files in a path and load a dataframe for each one. Finally, you can concatenate all dataframes into a single one with pd.concat:

fileDfs=(read_gct(path) for path in pathlib.glob('root/**/*.xml'))
final_df=pd.concat(fileDfs)