I have a txt file with the following example data:
id,001
v1,some_value
id,002
v1,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value
The original data has a xml structure. It has been flatten. So the order of the values is important.
In practice, the number of variables is huge (more than 1000). And there is nested structure: v1 can be the parent category for v2. We have this information in another file.
My question is: given this flatten structure, how can we get the original xml data back, with R or python?
In fact, at the end, I would like to convert these values into structured tabular data, with the columns: id, v1, v2, v3, etc. So if it is simpler to first get this format, it is also possible. Then we don't need the nested structure variable to do the conversion, but only a list of possibles variables in order. Let's say that for the example above, we have the variables: v1, v2, v3, and v4. So the final data would be:
[Update]: the order of variables is important to get the original data back.
For example, if we have:
id,001
v1,some_value
id,002
v2,some_value
v1,some_value
id,003
v2,some_value
id,004
v4,some_value
Then, we would get:
CodePudding user response:
You mention you have your existing fields in another file. Provided you can use these to make a named blank character vector like this:
existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")
Then you can do:
df <- read.csv(text = "my_textfile.csv", header = FALSE)
id_list <- split(df, cumsum(df$V1 == "id"))
do.call(rbind, lapply(id_list, function(x) {
vec <- setNames(x$V2, x$V1)
existing_list[match(names(vec), names(existing_list))] <- vec
as.data.frame(as.list(existing_list))
}))
#> id v1 v2 v3 v4
#> 1 001 some_value
#> 2 002 some_value some_value
#> 3 003 some_value
#> 4 004 some_value
Again, no external packages are used here.
CodePudding user response:
Here's a way to do it Python.
Assuming the input file 'flattened_xml.txt'
contains:
id,001
v1,some_value1
id,002
v2,some_value2
v1,some_value3
id,003
v2,some_value4
id,004
v4,some_value5
Code:
# Get lines of text file.
with open('flattened_xml.txt') as file:
data = file.read().splitlines()
# Determine what fiels are present.
fields = []
for line in data:
id, _ = line.split(',')
if id not in fields:
fields.append(id)
table = [{field:field for field in fields}] # Header row.
row = {}
for line in data:
field, value = line.split(',')
if field == fields[0]:
row = {field: value}
row.update({key: None for key in fields[1:]})
continue
row.update({key: value if key == field else None for key in fields[1:]})
table.append(row)
row = {key: None for key in fields}
for row in table:
s = [f'{value if value is not None else "":11}' for value in row.values()]
print('|' '|'.join(s) '|')
Result:
|id |v1 |v2 |v4 |
|001 |some_value1| | |
|002 | |some_value2| |
| |some_value3| | |
|003 | |some_value4| |
|004 | | |some_value5|