In my Java application I am using Azure Data Lake Storage Gen2 for storage (ABFS). In the class that handles the requests to the filesystem, I get a file path as an input and then use some regex to extract Azure connection info from it.
The Azure Data Lake Storage Gen2 URI is in the following format:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
I use the following regex abfss?://([^/] )@([^\\.] )(\\.[^/] )/?((. )?)
to parse a given file path to extract:
- fileSystem
- accountName
- accountSuffix
- relativePath (path file_name)
Below is just a test Java code with comments stating result/value in each variable after matching.
private void parsePath(String path) {
//path = abfs://[email protected]/selim/test.csv
Pattern azurePathPattern = Pattern.compile("abfss?://([^/] )@([^\\.] )(\\.[^/] )/?((. )?)");
Matcher matcher = azurePathPattern.matcher(path);
if (matcher.find()) {
String fileSystem = matcher.group(1); //storage
String accountName = matcher.group(2); //myaccount
String accountSuffix = matcher.group(3); //.dfs.core.windows.net
//relativePath is <path>/<file_name>
String relativePath = matcher.group(4); //selim/test.csv
}
}
The problem is when I decided to use Azurite which is an Azure Storage API compatible server (emulator) that allow me to run unit tests against this emulator instead of against an actual Azure Server as recommended in the Microsoft documentation.
Azurite uses a different file URI than Azure so this makes my above Regex invalid for testing purposes. Azurite file URI is in the following format:
abfs[s]://<file_system>@<local_ip>:<local_port>/<account_name>/<path>/<file_name>
Azurite default account_name
is devstoreaccount1 so here is an example path for a file on Azurite:
abfs://[email protected]:10000/devstoreaccount1/selim/test.csv
If parsed by above regex this will be the output, causing incorrect api calls to Azurite server:
- fileSystem: storage (correct)
- accountName: 127 (incorrect, should be: devstoreaccount1)
- accountSuffix: .0.0.1:10000 (incorrect, should be empty string)
- relativePath: devstoreaccount1/selim/test.csv (incorrect, should be selim/test.csv)
Is it possible to have a 1 regex that can handle both URIs or 2 regexes to solve this issue
CodePudding user response:
Solution 1
You can use a single pattern for this, but you will need to check which group matched in the code to determine where the necessary details are captured.
The regex will look like
abfss?://(?:([^@/]*)@(\d{1,3}(?:\.\d{1,3}){3}:\d )/([^/] )|([^/] )@([^.] )(\.[^/] ))(?:/(. ))?
See the regex demo. The ([^@/]*)@(\d{1,3}(?:\.\d{1,3}){3}:\d )/([^/] )
alternative allows capturing the file system, the IP address like part (with port number) after @
, and the account name after a slash.
The Java code will look like
import java.util.*;
import java.util.regex.*;
class Test
{
public static void main (String[] args) throws java.lang.Exception
{
Pattern pattern = Pattern.compile("abfss?://(?:([^@/]*)@(\\d{1,3}(?:\\.\\d{1,3}){3}:\\d )/([^/] )|([^/] )@([^.] )(\\.[^/] ))(?:/(. ))?");
String[] inputs = {
"abfs://[email protected]/selim/test.csv",
"abfs://[email protected]:10000/devstoreaccount1/selim/test.csv"
};
for (String s: inputs) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
if (matcher.group(5) != null) { // If original URL is found
String fileSystem = matcher.group(4); //storage
String accountName = matcher.group(5); //myaccount
String accountSuffix = matcher.group(6); //.dfs.core.windows.net
String relativePath = matcher.group(7); //selim/test.csv
System.out.println(s ":\nfileSystem: " fileSystem "\naccountName: " accountName "\naccountSuffix: '" accountSuffix "'\nrelativePath:" relativePath "\n-----");
} else { // we have an Azurite URL
String fileSystem = matcher.group(1); //storage
String accountName = matcher.group(3); //devstoreaccount1
String accountSuffix = ""; // empty (or do you need matcher.group(2) to get "127.0.0.1:10000"?)
String relativePath = matcher.group(7); //selim/test.csv
System.out.println(s ":\nfileSystem: " fileSystem "\naccountName: " accountName "\naccountSuffix: '" accountSuffix "'\nrelativePath:" relativePath "\n-----");
}
}
}
}
}
Output:
abfs://[email protected]/selim/test.csv:
fileSystem: storage
accountName: myaccount
accountSuffix: '.dfs.core.windows.net'
relativePath:selim/test.csv
-----
abfs://[email protected]:10000/devstoreaccount1/selim/test.csv:
fileSystem: storage
accountName: devstoreaccount1
accountSuffix: ''
relativePath:selim/test.csv
Solution 2
You can use two different regular expressions and if the first one does not find a match, the second will be tried. The first one:
abfss?://([^@/]*)@(\d{1,3}(?:\.\d{1,3}){3}:\d )/([^/] )(?:/(. ))?
See this regex demo. The second one:
abfss?://([^/] )@([^.] )(\.[^/] )(?:/(. ))?
See this regex demo. As this one also matches URLs of the first type, you need to make sure you run them in the fixed order.
See the Java demo:
import java.util.*;
import java.util.regex.*;
class Test
{
public static void main (String[] args) throws java.lang.Exception
{
Pattern pattern_azurite = Pattern.compile("abfss?://([^@/]*)@(\\d{1,3}(?:\\.\\d{1,3}){3}:\\d )/([^/] )(?:/(. ))?");
Pattern pattern_original = Pattern.compile("abfss?://([^/] )@([^.] )(\\.[^/] )(?:/(. ))?");
String[] inputs = {
"abfs://[email protected]/selim/test.csv",
"abfs://[email protected]:10000/devstoreaccount1/selim/test.csv",
"http://www.blahblah.blah"
};
for (String s: inputs) {
Map<String, String> result = null;
Matcher matcher_azurite = pattern_azurite.matcher(s);
if (matcher_azurite.find()){
result = parseMatchResult(matcher_azurite, new int[] {1, 3, -1, 4});
} else {
Matcher matcher_original = pattern_original.matcher(s);
if (matcher_original.find()){
result = parseMatchResult(matcher_original, new int[] {1, 2, 3, 4});
}
}
if (result != null) { // Now print
for (String key : result.keySet()) {
System.out.println("'" key "': '" result.get(key) "'");
}
System.out.println("----------------");
} else {
System.out.println("No match!");
}
}
}
public static Map<String, String> parseMatchResult(Matcher m, int[] indices) {
Map<String, String> res = new HashMap<String, String>();
res.put("fileSystem", m.group(indices[0]));
res.put("accountName", m.group(indices[1]));
res.put("accountSuffix", indices[2] > -1 ? m.group(indices[2]) : "");
res.put("relativePath", m.group(indices[3]));
return res;
}
}
Output:
'fileSystem': 'storage'
'accountSuffix': '.dfs.core.windows.net'
'accountName': 'myaccount'
'relativePath': 'selim/test.csv'
----------------
'fileSystem': 'storage'
'accountSuffix': ''
'accountName': 'devstoreaccount1'
'relativePath': 'selim/test.csv'
----------------
No match!