Home > Software engineering >  Regex to parse Azure Data Lake Storage Gen2 URI for production and testing with Azurite
Regex to parse Azure Data Lake Storage Gen2 URI for production and testing with Azurite

Time:02-24

In my Java application I am using Azure Data Lake Storage Gen2 for storage (ABFS). In the class that handles the requests to the filesystem, I get a file path as an input and then use some regex to extract Azure connection info from it.

The Azure Data Lake Storage Gen2 URI is in the following format:

abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>

I use the following regex abfss?://([^/] )@([^\\.] )(\\.[^/] )/?((. )?) to parse a given file path to extract:

  • fileSystem
  • accountName
  • accountSuffix
  • relativePath (path file_name)

Below is just a test Java code with comments stating result/value in each variable after matching.

private void parsePath(String path) {
    //path = abfs://[email protected]/selim/test.csv
    Pattern azurePathPattern = Pattern.compile("abfss?://([^/] )@([^\\.] )(\\.[^/] )/?((. )?)");
    Matcher matcher = azurePathPattern.matcher(path);
    if (matcher.find()) {
        String fileSystem = matcher.group(1); //storage
        String accountName = matcher.group(2); //myaccount
        String accountSuffix = matcher.group(3); //.dfs.core.windows.net
        //relativePath is <path>/<file_name>
        String relativePath = matcher.group(4); //selim/test.csv
    }
}

The problem is when I decided to use Azurite which is an Azure Storage API compatible server (emulator) that allow me to run unit tests against this emulator instead of against an actual Azure Server as recommended in the Microsoft documentation.

Azurite uses a different file URI than Azure so this makes my above Regex invalid for testing purposes. Azurite file URI is in the following format:

abfs[s]://<file_system>@<local_ip>:<local_port>/<account_name>/<path>/<file_name>

Azurite default account_name is devstoreaccount1 so here is an example path for a file on Azurite:

abfs://[email protected]:10000/devstoreaccount1/selim/test.csv

If parsed by above regex this will be the output, causing incorrect api calls to Azurite server:

  • fileSystem: storage (correct)
  • accountName: 127 (incorrect, should be: devstoreaccount1)
  • accountSuffix: .0.0.1:10000 (incorrect, should be empty string)
  • relativePath: devstoreaccount1/selim/test.csv (incorrect, should be selim/test.csv)

Is it possible to have a 1 regex that can handle both URIs or 2 regexes to solve this issue

CodePudding user response:

Solution 1

You can use a single pattern for this, but you will need to check which group matched in the code to determine where the necessary details are captured.

The regex will look like

abfss?://(?:([^@/]*)@(\d{1,3}(?:\.\d{1,3}){3}:\d )/([^/] )|([^/] )@([^.] )(\.[^/] ))(?:/(. ))?

See the regex demo. The ([^@/]*)@(\d{1,3}(?:\.\d{1,3}){3}:\d )/([^/] ) alternative allows capturing the file system, the IP address like part (with port number) after @, and the account name after a slash.

The Java code will look like

import java.util.*;
import java.util.regex.*;

class Test
{
    public static void main (String[] args) throws java.lang.Exception
    {
        Pattern pattern = Pattern.compile("abfss?://(?:([^@/]*)@(\\d{1,3}(?:\\.\\d{1,3}){3}:\\d )/([^/] )|([^/] )@([^.] )(\\.[^/] ))(?:/(. ))?");
        String[] inputs = {
            "abfs://[email protected]/selim/test.csv",
            "abfs://[email protected]:10000/devstoreaccount1/selim/test.csv"
        };
        for (String s: inputs) {
            Matcher matcher = pattern.matcher(s);
            if (matcher.find()){
                if (matcher.group(5) != null) { // If original URL is found
                    String fileSystem = matcher.group(4); //storage
                    String accountName = matcher.group(5); //myaccount
                    String accountSuffix = matcher.group(6); //.dfs.core.windows.net
                    String relativePath = matcher.group(7); //selim/test.csv
                    System.out.println(s   ":\nfileSystem: "   fileSystem   "\naccountName: "   accountName   "\naccountSuffix: '"   accountSuffix   "'\nrelativePath:"   relativePath   "\n-----");
                } else { // we have an Azurite URL
                    String fileSystem = matcher.group(1); //storage
                    String accountName = matcher.group(3); //devstoreaccount1
                    String accountSuffix = ""; // empty (or do you need matcher.group(2) to get "127.0.0.1:10000"?)
                    String relativePath = matcher.group(7); //selim/test.csv
                    System.out.println(s   ":\nfileSystem: "   fileSystem   "\naccountName: "   accountName   "\naccountSuffix: '"   accountSuffix   "'\nrelativePath:"   relativePath   "\n-----");
                }
            }
        }
    }
}

Output:

abfs://[email protected]/selim/test.csv:
fileSystem: storage
accountName: myaccount
accountSuffix: '.dfs.core.windows.net'
relativePath:selim/test.csv
-----
abfs://[email protected]:10000/devstoreaccount1/selim/test.csv:
fileSystem: storage
accountName: devstoreaccount1
accountSuffix: ''
relativePath:selim/test.csv

Solution 2

You can use two different regular expressions and if the first one does not find a match, the second will be tried. The first one:

abfss?://([^@/]*)@(\d{1,3}(?:\.\d{1,3}){3}:\d )/([^/] )(?:/(. ))?

See this regex demo. The second one:

abfss?://([^/] )@([^.] )(\.[^/] )(?:/(. ))?

See this regex demo. As this one also matches URLs of the first type, you need to make sure you run them in the fixed order.

See the Java demo:

import java.util.*;
import java.util.regex.*;

class Test
{
    public static void main (String[] args) throws java.lang.Exception
    {
        Pattern pattern_azurite = Pattern.compile("abfss?://([^@/]*)@(\\d{1,3}(?:\\.\\d{1,3}){3}:\\d )/([^/] )(?:/(. ))?");
        Pattern pattern_original = Pattern.compile("abfss?://([^/] )@([^.] )(\\.[^/] )(?:/(. ))?");
        String[] inputs = {
            "abfs://[email protected]/selim/test.csv",
            "abfs://[email protected]:10000/devstoreaccount1/selim/test.csv",
            "http://www.blahblah.blah"
        };
        for (String s: inputs) {
            Map<String, String> result = null;
            Matcher matcher_azurite = pattern_azurite.matcher(s);
            if (matcher_azurite.find()){
                result = parseMatchResult(matcher_azurite, new int[] {1, 3, -1, 4});
            } else {
                Matcher matcher_original = pattern_original.matcher(s);
                if (matcher_original.find()){
                    result = parseMatchResult(matcher_original, new int[] {1, 2, 3, 4});
                }
            }
            if (result != null) {                        // Now print
                for (String key : result.keySet()) {
                    System.out.println("'"   key   "': '"   result.get(key)   "'");
                }
                System.out.println("----------------");
            } else {
                System.out.println("No match!");
            }
            
        }
    }
    public static Map<String, String> parseMatchResult(Matcher m, int[] indices) {
        Map<String, String> res = new HashMap<String, String>();
        res.put("fileSystem", m.group(indices[0]));
        res.put("accountName", m.group(indices[1]));
        res.put("accountSuffix", indices[2] > -1 ? m.group(indices[2]) : "");
        res.put("relativePath", m.group(indices[3]));
        return res;
    }
}

Output:

'fileSystem': 'storage'
'accountSuffix': '.dfs.core.windows.net'
'accountName': 'myaccount'
'relativePath': 'selim/test.csv'
----------------
'fileSystem': 'storage'
'accountSuffix': ''
'accountName': 'devstoreaccount1'
'relativePath': 'selim/test.csv'
----------------
No match!
  • Related