Home > database >  Parse street addresses from text files using Node.js
Parse street addresses from text files using Node.js

Time:09-24

I am trying to solve this problem where I have to read the text files as an input and create object array with Node.js. The only corner cases are there are extra white spaces.

Input:

89 Westport Ave.
Pembroke Pines, FL 33028

9529 Bayport Rd.
Eau Claire, WI 54701

9957 Wakehurst Street
Suite 42
Bonita Springs, FL 34135

8233 Franklin Drive
Neenah, WI 54956

Output:

[ {

address1: '89 Westport Ave.',
address2: null,
city: 'Pembroke Pines',
state: 'FL',
zip: '33028' },

{

address1: '9529 Bayport Rd.',
address2: null,
city: 'Eau Claire',
state: 'WI',
zip: '54701' },

{

address1: '9957 Wakehurst Street',
address2: 'Suite 42',
city: 'Bonita Springs',
state: 'FL',
zip: '34135' },

{

address1: '8233 Franklin Drive',
address2: null,
city: 'Neenah',
state: 'WI',
zip: '54956' }]

Code I am trying:

  const parseAddressFile = path => {
  const fs = require('fs');
  const readline = require('readline');

  const data = readline.createInterface({
    input: fs.createReadStream(path)
  });
  
  
  let address = {address1: "",
                address2: "",
                city: "",
                state: "",
                zip: ""};
  const addressList = [];
  data.on('line', function (line) {
    line = line.trim();
  addressList.push(line);
//     console.log(addressList);
});

  function line2() {
    var lines = addressList.split(',');
    return lines;
  }
  
//   console.log(line2());

  data.on('close', function (line) {
  // array console.log(addressList);
//    var Ncount = 0;
   for(var x =0; x < addressList.length; x  ){
//      console.log(address);
//      console.log(addressList[0]);
     address['address1'] = addressList[x];
     
     if (addressList[x].match('Suite 42')){
          address['address2'] = 'Suite 42';
        }else{
          address['address2'] = null;
        }
     
//      address['address2'] = null;
     address['city'] = addressList[line2(x)];
     
     address['state'] = addressList[x];
     
     address['zip'] = addressList[x];
      console.log(address);

  }


 });
};

module.exports = parseAddressFile;

CodePudding user response:

If you're truly, absolutely, 1,000% sure your address data doesn't materially differ from the sample data you've provided here, you can use a well-crafted RegExp to extract what you need based on the implicit pattern your data takes on:

const addressTexts = [
  `89 Westport Ave.
Pembroke Pines, FL 33028`,
  `9529 Bayport Rd.
Eau Claire, WI 54701`,
  `9957 Wakehurst Street
Suite 42
Bonita Springs, FL 34135`,
  `8233 Franklin Drive
Neenah, WI 54956`
];

const parseAddress = addressText => /(?<address1>\d . ?)\n(?:(?<address2>. ?(?!\d{5}))\n)?(?<city>[\w\s] ?),\s(?<state>[A-Z]{2})\s(?<zip>\d{5})/g.exec(addressText).groups;

addressTexts.forEach(ele => console.log(parseAddress(ele)));

Based on what your data looks like in actuality, you may have to tweak the pattern.

CodePudding user response:

I already provided a rudimentary answer using RegExp, which will work well enough if the data in question is as truly clean & consistent as the OP has assured us it is. However, my advice on writing a parser yourself for this kind of data is: don't.

There are an absolutely ridiculous number of edge cases you must consider to effectively deal with an otherwise unstructured string representing a postal address. In the United States alone, there are a nearly an infinite number of "legal" ways to express the same exact address. Fortunately, we have services and open-source options to increase the effectiveness of code that must make use of this unstructured address data.


If you trust the data source and just need to do more rudimentary "segmentation" of the individual components of the address

Use a library like openvenues/libpostal. It's the culmination of a machine learning model being trained to parse addresses from unstructured strings. They've fed the model "over 1 billion addresses in every inhabited country on Earth", so the international coverage is great, as well.

There are even official Node bindings for libpostal which are relatively easy to set up. Once you've followed the the Installation guide, just pass your address strings into the model with a single line of code:

var postal = require('node-postal');
postal.parser.parse_address('Barboncino 781 Franklin Ave, Crown Heights, Brooklyn, NY 11238');

If you care about the validity of the address itself

If you need to ensure that, at some level, there will be an address in real life that corresponds with the data, use a geocoder service. Companies like Google, Microsoft, and other smaller players license their map data, usually for a fee. This includes the ability to provide an unstructured string and ensure that your data correlates with a real-life street address that they've collected previously.

  • Related