Home > Net >  Reading all the XML files to make dataframe
Reading all the XML files to make dataframe

Time:05-23

I asked the question about reading the xml data to pandas dataframe

NLP using XLM dataset

I got the following answer

medlinecitation = pd.read_xml("Taxonomy_NLP/public_dat/trainset/17846141.xml", xpath=".//medlinecitation").dropna(axis=1)
abstract = pd.read_xml("Taxonomy_NLP/public_dat/trainset/17846141.xml", xpath=".//abstract")

dfa = pd.merge(
    left=medlinecitation,
    right=abstract,
    how="outer",
    left_index=True,
    right_index=True,
).fillna(method="ffill")

The following output

    owner   status  pmid    citationsubset  otherid     abstracttext
0   NLM     MEDLINE     17846141    IM  PMC2151464  Empirical use of beta-lactam antibiotics, the ...

I have the following files in my folder and want to read each file like above and make a data frame for each row representing the file. I could read all the files through the following function, but don't know how to make data frame with all the files.

for dirname, _, filenames in os.walk(path):
    for filename in filenames:
        print(os.path.join(dirname, filename))
Taxonomy_NLP/public_dat/trainset/17846141.xml
Taxonomy_NLP/public_dat/trainset/10649814.xml
Taxonomy_NLP/public_dat/trainset/20091541.xml
Taxonomy_NLP/public_dat/trainset/11493721.xml
Taxonomy_NLP/public_dat/trainset/11505031.xml
Taxonomy_NLP/public_dat/trainset/14557142.xml
Taxonomy_NLP/public_dat/trainset/15174889.xml
Taxonomy_NLP/public_dat/trainset/1565551.xml
Taxonomy_NLP/public_dat/trainset/15159270.xml
Taxonomy_NLP/public_dat/trainset/12837416.xml
Taxonomy_NLP/public_dat/trainset/10629474.xml

CodePudding user response:

I'm not sure what are the relationships between the file in the directory and the two paths you mention above.

This is iteration on the data you have:

def xml_to_df(path1, path2):
    medlinecitation = pd.read_xml(path1, xpath=".//medlinecitation").dropna(axis=1)
    abstract = pd.read_xml(path2, xpath=".//abstract")

    dfa = pd.merge(
    left=medlinecitation,
    right=abstract,
    how="outer",
    left_index=True,
    right_index=True,
).fillna(method="ffill")

list_of_dfs = list()


for dirname, _, filenames in os.walk(path):
    for filename in filenames:
        path1 = os.path.join(dirname, filename)
        list_of_dfs.append(xml_to_df(path1, path2))

CodePudding user response:

Your code seems to be loading two different elements from the same XML file. You can create a function to do this which returns the new dataframe:

def read_gct(path):
  medlinecitation = pd.read_xml(path, xpath=".//medlinecitation")
                      .dropna(axis=1)
  abstract = pd.read_xml(path, xpath=".//abstract")

  dfa = pd.merge(
      left=medlinecitation,
      right=abstract,
      how="outer",
      left_index=True,
      right_index=True,
  ).fillna(method="ffill")
  return dfa

You can use pathlib.glob to find all XML files in a path and load a dataframe for each one. Finally, you can concatenate all dataframes into a single one with pd.concat:

fileDfs=(read_gct(path) for path in pathlib.glob('root/**/*.xml'))
final_df=pd.concat(fileDfs)
  • Related