Home > Enterprise >  Parsing identical foreign dictionary JSON file entries, but having different meanings
Parsing identical foreign dictionary JSON file entries, but having different meanings

Time:03-15

I am reading in the following bilingual English <-> Japanese dictionary data from the iOS app bundle directory formatted as a json file having identical entries, but with different meanings (i.e. abandon) as shown in 'json data set 1' below:

{
"aardvark":"土豚 (つちぶた)",
"abacus":"算盤 (そろばん)",
"abacus":"十露盤 (そろばん)",
"abalone":"鮑 (あわび)",
"abandon":"乗り捨てる (のりすてる)(a ship or vehicle)",
"abandon":"取り下げる (とりさげる)(e.g. a lawsuit)",
"abandon":"捨て去る (すてさる)(ship)",
"abandon":"泣し沈む (なきしずむ)oneself to grief",
"abandon":"遺棄する (いき)",
"abandon":"握りつぶす (にぎりつぶす)",
"abandon":"握り潰す (にぎりつぶす)",
"abandon":"見限る (みかぎる)",
"abandon":"見切り (みきり)",
"abandon":"見捨てる (みすてる)",
"abandon":"突き放す[見捨てる (つきはなす)",
"abandon":"放り出す (ほうりだす)",
"abandon":"廃棄 (はいき)",
"abandon":"廃棄する (はいき)",
"abandon":"放棄する (ほうき)",
}

I am using the code snippet below to read in the data from the app.bundle directory:

var vocab:[String:String] = [:]
    
do {
                     
            let path = Bundle.main.path(forResource: "words_alpha", ofType: "json")!
            let text = try! String(contentsOfFile: path, encoding: String.Encoding.utf8)
             
                            do {
                            
                            vocab = try JSONDecoder().decode([String: String].self, from: Data(text.utf8))
                            print(text)
                            }
                                    } catch {
                                        print(error)
                                   
                    
                }
     
        }

Question: Only the first entry of a duplicate entry is being read in whereas I would like to have all duplicate entries read in as multiple definitions for a single dictionary item/term.

One solution is to reformat duplicate entries in the json data as shown below by adding line returns between different definitions in 'json data set 2':

"abandon":"乗り捨てる (のりすてる)(a ship or vehicle)\n\n取り下げる (とりさげる)(e.g. a lawsuit)\n\n捨て去る (すてさる)(ship)\n\n 泣し沈む (なきしずむ)oneself to grief\n\n",

However, that is a huge amount of work editing a 30MB json data file to make the above changes for duplicate items so I am looking for a quick and dirty way to use swift json string manipulations to read in the data 'as is' using the native 'data set 1' format with each entry being on a line by itself as shown below: { "abandon":"乗り捨てる (のりすてる)(a ship or vehicle)", "abandon":"取り下げる (とりさげる)(e.g. a lawsuit)", }

Have tried a number of approaches, but none have worked so far. Any suggestions very much appreciated.

CodePudding user response:

Dictionary can't have the same key, there is unicity.

If you use for instance JSONLint, you'll have a "Error: Duplicate key 'abacus'" (it stops at first error found).

However, that is a huge amount of work editing a 30MB json data file to make the above changes for duplicate items so I am looking for a quick and dirty way

Instead of thinking that way, let's pre-process the JSON, and fix it!

So you could write a little script beforehand to fix your JSON. You can do it in Swift!

In a quick & dirty way, you could do this:

All definitions must be one line (else you might have to fix manually for them)

Create a fixJSON.swift file (in Terminal.app: $>touch fixJSON.swift), make it executable ($ chmod x fixJSON.swift), put that code inside it:

#!/usr/bin/swift

import Foundation 

func fixingJSON(_ input: String) -> String? {
    let lines = input.components(separatedBy: .newlines)
    let regex = try! NSRegularExpression(pattern: "\"(\\w )\":\"(.*?)\",", options: [])
    let output = lines.reduce(into: [String: [String]]()) { partialResult, aLine in
        var cleaned = aLine.trimmingCharacters(in: .whitespaces)
        guard !cleaned.isEmpty else { return }
        if !cleaned.hasSuffix(",") { //Articially add a ",", for the last line case
            cleaned.append(",")
        }
        guard let match = regex.firstMatch(in: cleaned, options: [], range: NSRange(location: 0, length: cleaned.utf16.count)) else { return }
        guard let wordRange = Range(match.range(at: 1), in: cleaned),
              let definitionRange = Range(match.range(at: 2), in: cleaned) else { return }
        partialResult[String(cleaned[wordRange]), default: [String]()]  = [String(cleaned[definitionRange])]
    }

//        print(output)

    do {
        let encoder = JSONEncoder()
        encoder.outputFormatting = [.prettyPrinted, .sortedKeys]
        let asJSONData = try encoder.encode(output)
        let asJSONString = String(data: asJSONData, encoding: .utf8)
        // print(asJSONString!)
        return asJSONString
    } catch {
        print("Error while encoding: \(error)")
        return nil
    }
}


func main() {
    do {
        //To change
        let path = "translate.json"
        let content = try String(contentsOfFile: path)
        guard let output = fixingJSON(content) else { return }
        //To change
        let outputPath = "translate2.json"
        try output.write(to: URL(fileURLWithPath: outputPath), atomically: true, encoding: .utf8)
    } catch {
        print("Oops, error while trying to read or write content of file:\n\(error)")
    }
}

main()

Modify path/output path values, it's easier if you put it as the same place as the script file, then the path will be just the name of the file.

Then, in Terminal.app, just write $> ./fixJSON.swift

Okay, now, let's talk about what the script does. As said, it's quick & dirty, and might have issues.

We read the content of the JSON with issue, I iterate over the lines, then used a regex, to find this:

"englishWord":"anything",

I artificially add a comma if there isn't (special case for the last entry of the JSON which shouldn't have one). As to why, it's because there could be double quotes in a translation, so it could generate issues. It's just a quick & dirty fix. I might do better, but since it's a quick fix, spending more time to write beautiful code might be overkill for a one time use.

In the end, you'll have a [String: [String]] JSON.

CodePudding user response:

This is fine JSON (except for the last comma, which isn't legal), but Swift's JSONDecoder can't handle it. JSON allows duplicate keys, but Swift doesn't. So you'll need to parse it by hand.

If your data is exactly as given, one record per line, with nothing "weird" (like embedded \" in the Strings), the easiest way to do that is to just parse it line by line, using simple String manipulation or NSRegularExpression.

If this is more arbitrary JSON, then you may want to use a full JSON parser that can handle this, such as RNJSON. Note that this is just a hobby project to build a JSON parser that exactly follows the spec, and as much intended as an example of how to write these things as a serious framework, but it can handle this JSON (as long as you get rid of that last , which is not legal).

import RNJSON

let keyValues = try JSONParser()
    .parse(data: json)
    .keyValues()
    .lazy
    .map({($0, [try $1.stringValue()])})

let translations = Dictionary(keyValues, uniquingKeysWith:  )

// [
    "abandon": ["乗り捨てる (のりすてる)(a ship or vehicle)", "取り下げる (とりさげる)(e.g. a lawsuit)", "捨て去る (すてさる)(ship)", "泣し沈む (なきしずむ)oneself to grief", "遺棄する (いき)", "握りつぶす (にぎりつぶす)", "握り潰す (にぎりつぶす)", "見限る (みかぎる)", "見切り (みきり)", "見捨てる (みすてる)", "突き放す[見捨てる (つきはなす)", "放り出す (ほうりだす)", "廃棄 (はいき)", "廃棄する (はいき)", "放棄する (ほうき)"], 
    "aardvark": ["土豚 (つちぶた)"],
    "abacus": ["算盤 (そろばん)", "十露盤 (そろばん)"],
    "abalone": ["鮑 (あわび)"]
   ]

It's not that complicated a framework, so you could also adapt it to your own needs (making it accept that last non-standard comma, for example).

But, if possible, I'd personally just parse it line by line with simple String manipulation. That would be the easiest to implement using AsyncLineSequence, which would avoid pulling all 30MB into the memory before parsing.

  • Related