XML Data flatten, how to get the original data back-CodePudding

I have a txt file with the following example data:

id,001
v1,some_value
id,002
v1,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value

The original data has a xml structure. It has been flatten. So the order of the values is important.

In practice, the number of variables is huge (more than 1000). And there is nested structure: v1 can be the parent category for v2. We have this information in another file.

My question is: given this flatten structure, how can we get the original xml data back, with R or python?

In fact, at the end, I would like to convert these values into structured tabular data, with the columns: id, v1, v2, v3, etc. So if it is simpler to first get this format, it is also possible. Then we don't need the nested structure variable to do the conversion, but only a list of possibles variables in order. Let's say that for the example above, we have the variables: v1, v2, v3, and v4. So the final data would be:

[Update]: the order of variables is important to get the original data back.

For example, if we have:

id,001
v1,some_value
id,002
v2,some_value
v1,some_value
id,003
v2,some_value
id,004
v4,some_value

Then, we would get:

CodePudding user response：

You mention you have your existing fields in another file. Provided you can use these to make a named blank character vector like this:

existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")

Then you can do:

df <- read.csv(text = "my_textfile.csv", header = FALSE)

id_list <- split(df, cumsum(df$V1 == "id"))

do.call(rbind, lapply(id_list, function(x) {
  vec <- setNames(x$V2, x$V1)
  existing_list[match(names(vec), names(existing_list))] <- vec
  as.data.frame(as.list(existing_list))
  }))

#>    id         v1         v2 v3         v4
#> 1 001 some_value                         
#> 2 002 some_value some_value              
#> 3 003            some_value              
#> 4 004                          some_value

Again, no external packages are used here.

CodePudding user response：

Here's a way to do it Python.

Assuming the input file 'flattened_xml.txt'contains:

id,001
v1,some_value1
id,002
v2,some_value2
v1,some_value3
id,003
v2,some_value4
id,004
v4,some_value5

Code:

# Get lines of text file.
with open('flattened_xml.txt') as file:
    data = file.read().splitlines()

# Determine what fiels are present.
fields = []
for line in data:
    id, _ = line.split(',')
    if id not in fields:
        fields.append(id)

table = [{field:field for field in fields}]  # Header row.
row = {}
for line in data:
    field, value = line.split(',')

    if field == fields[0]:
        row = {field: value}
        row.update({key: None for key in fields[1:]})
        continue

    row.update({key: value if key == field else None for key in fields[1:]})
    table.append(row)
    row = {key: None for key in fields}

for row in table:
    s = [f'{value if value is not None else "":11}' for value in row.values()]
    print('|'   '|'.join(s)   '|')

Result:

|id         |v1         |v2         |v4         |
|001        |some_value1|           |           |
|002        |           |some_value2|           |
|           |some_value3|           |           |
|003        |           |some_value4|           |
|004        |           |           |some_value5|