Home > Net >  How to read a text data into pandas dataframe which is partially in key value format
How to read a text data into pandas dataframe which is partially in key value format

Time:09-21

File content of file test:

costCenter: LL63238012
mail: [email protected]
LLpResponsible: cn=LLf58420,ou=Personal,ou=People,ou=LLDI,o=LLP
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=economy,NisMap=llc1002:/proj/llc1002_ziz1/q,Quota=10621
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=scratKG,NisMap=llc1002:/proj/llc1002_ziz1_scratKG/q,Quota=12000,Id=scratKG
fullName: Tulip project ziz1

costCenter: MX61FRK604
mail: [email protected]
LLpResponsible: cn=LLa11826,ou=Personal,ou=People,ou=LLDI,o=LLP
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_AMEC_MX-GDL01,ou=Locations,ou=LLDI,o=LLP#0#Quality=reference,NisMap=llc0156:/proj/llc0156_zmx28home_3/q,Quota=100,Id=3
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_AMEC_MX-GDL01,ou=Locations,ou=LLDI,o=LLP#0#Quality=reference,NisMap=llc0156:/proj/llc0156_zmx28home/q,Quota=300
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_AMEC_MX-GDL01,ou=Locations,ou=LLDI,o=LLP#0#Quality=reference,NisMap=llc0156:/proj/llc0156_zmx28home_2/q,Quota=100,Id=2
fullName: xFSL to LLDI migration



costCenter: RU61FPD561
mail: [email protected]
LLpResponsible: cn=LLa09278,ou=Personal,ou=People,ou=LLDI,o=LLP
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quota=1,Quality=EconomyHP,NisMap=llc2002:/proj/llc2002_zru12/q
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quota=2800,Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_analog/q,Id=analog
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quota=1100,Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_home/q,Id=home
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_libddk/q,Quota=2162,Id=libddk
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_proj/q,Quota=1102,Id=proj
fullName: zru12

costCenter: KG63010285
mail: [email protected]
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_EMEA_NL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=BLLinessCriticalHP,Quota=60,NisMap=llc4008:/proj/llc4008_zuriKG/q
fullName: Container to store ZuriKG vault



costCenter: KG63010285
mail: [email protected]
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_EMEA_NL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,Quota=1,NisMap=llc3008:/proj/llc3008_zuriKG_rme/q
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_EMEA_NL-RME01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,Quota=30,NisMap=llc4014:/proj/llc4014_zuriKG_rme/q
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_EMEA_NL-RME01,ou=Locations,ou=LLDI,o=LLP#0#Quality=ScratKGHP,Quota=400,NisMap=llc4014:/proj/llc4014_zuriKG_rme_scratKG/q,Id=scratKG
fullName: Project to restore  project data on the RME work on HP-UX

Code which i am using:

def generate(data):
    for record in data.split("\n\n"):              # Split records based on two newlines (unix)
        result = {}
        for line in record.split("\n"):            # Split properties based on single newlines (unix)
            if line:                               # Skip empty lines happening for extra or trailing newlines
                key, *value = line.split(": ")     # Tolerant to lines with more than a single ´: ´ (*values)
                value = ": ".join(value)           # Recover original value if more than a single (`: `)
                if key in result:
                    result[key]  = ";"   value
                else:
                    result[key] = value
        if result:                                 # Don't yield empty results
            yield result


frame = pd.DataFrame(generate("test"))
print(frame)

#frame["responsible"] = frame["LLpResponsible"].str.extract("cn=([\w]*)")
frame["location"] = frame["LLpHomeDirectory"].str.extract("ou=([\w_\-]*)")
frame["directory"] = frame["LLpHomeDirectory"].str.findall("NisMap=\w :([\w_\-/]*)")

df1 = frame[['costCenter',  'mail', 'responsible', 'location', 'directory']]
#df2 = df.explode("directory")[["costCenter", "responsible", "directory", "mail", "location"]]
print(df1)

Error:

When i am running this, i am getting below ....

 $ python ldapDataParse1
  test
0
Traceback (most recent call last):
  File "/home/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'LLpHomeDirectory'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ldapDataParse1", line 30, in <module>
    frame["location"] = frame["LLpHomeDirectory"].str.extract("ou=([\w_\-]*)")
  File "/home/lib64/python3.6/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 'LLpHomeDirectory'

CodePudding user response:

You just need to copy paste those two functions together, the original autor even gave you the comments what the code does. You need the context manager (with open(...)...) to open the file and then after for record in handler.read().split("\n\n") you just use the 2nd version of the answer to take care of extra newlines.

def generate(file):
    with open(file) as handler:
        for record in handler.read().split("\n\n"):
            result = {}
            for line in record.split("\n"):
                if line:
                    key, *value = line.split(": ")
                    value = ": ".join(value)
                    if key in result:
                        result[key]  = ";"   value
                    else:
                        result[key] = value
            if result:
                yield result
frame = pd.DataFrame(generate("test.txt"))

frame["responsible"] = frame["LLpResponsible"].str.extract("cn=([\w]*)")
frame["location"] = frame["LLpHomeDirectory"].str.extract("ou=([\w_\-]*)")
frame["directory"] = frame["LLpHomeDirectory"].str.findall("NisMap=\w :([\w_\-/]*)")
print(frame[["costCenter", "responsible", "directory"]])

Output:

   costCenter responsible                                          directory
0  MX61FRK604    LLa11826  [/proj/llc0156_zmx28home_3/q, /proj/llc0156_zm...
1  RU61FPD561    LLa09278  [/proj/llc2002_zru12/q, /proj/llc1002_zru12_an...
2  KG63010285         NaN                           [/proj/llc4008_zuriKG/q]
3  KG63010285         NaN  [/proj/llc3008_zuriKG_rme/q, /proj/llc4014_zur...
  • Related