Home > database >  Java - extract from line based on regex
Java - extract from line based on regex

Time:06-06

Small question regarding a Java job to extract information out of lines from a file please.

Setup, I have a file, in which one line looks like this:

bla,bla42bla()bla=bla blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla

The file contains many of those lines (as describe above) In each of the lines, there are two particular information I am interested in, the primaryKey, and the country.

In my example, ZAPDBHV7120D41A and USA

For sure, each line of the file has exactly once the primaryKey, and exactly once the country, they are separated by a comma. It is there exactly once. in no particular order (it can appear at the start of the line, middle, end of the line, etc).

The primary key is a combination of alphabet in caps [A, B, C, ... Y, Z] and numbers [0, 1, 2, ... 9]. It has no particular predefined length.

The primary key is always in between primaryKey="({primaryKey},{country}, Meaning, the actual primaryKey is found after the string primaryKey-equal-quote-open parenthesis. And before another comma three letters country comma.

I would like to write a program, in which I can extract all the primary key, as well as all countries from the file.

Input:

bla,bla42bla()bla=bla blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
bla  blabla()bla=bla blablaprimaryKey="(AA45555DBMW711DD4100,ARG,bla
[...]

Result:

The primaryKey is ZAPDBHV7120D41A
The country is USA

The primaryKey is AA45555DBMW711DD4100
The country is ARG

Therefore, I tried following:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;

public class RegexExtract {

    public static void main(String[] args) throws Exception {
        final String             csvFile = "my_file.txt";
        try (final BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
            String line;
            while ((line = br.readLine()) != null) {
                Pattern.matches("", line); // extract primaryKey and country based on regex
                String primaryKey = ""; // extract the primary from above
                String country = ""; // extract the country from above
                System.out.println("The primaryKey is "   primaryKey);
                System.out.println("The country is "   country);
            }
        }
    }
}

But I am having a hard time constructing the regular expression needed to match and extract.

May I ask what is the correct code in order to extract from the line based on above information?

Thank you

CodePudding user response:

Explanations after the code.

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexExtract {

    public static void main(String[] args) {
        Path path = Paths.get("my_file.txt");
        try (BufferedReader br = Files.newBufferedReader(path)) {
            Pattern pattern = Pattern.compile("primaryKey=\"\\(([A-Z0-9] ),([A-Z] )");
            String line = br.readLine();
            while (line != null) {
                Matcher matcher = pattern.matcher(line);
                if (matcher.find()) {
                    String primaryKey = matcher.group(1);
                    String country = matcher.group(2);
                    System.out.println("The primaryKey is "   primaryKey);
                    System.out.println("The country is "   country);
                }
                line = br.readLine();
            }
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

Running the above code produces the following output (using the two sample lines in your question).

The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG

The regular expression looks for the following [literal] string

primaryKey="(

The double quote is escaped since it is within a string literal.
The opening parenthesis is escaped because it is a metacharacter and the double backslash is required since Java does not recognize \( in a string literal.

Then the regular expression groups together the string of consecutive capital letters and digits that follow the previous literal up to (but not including) the comma.

Then there is a second group of capital letters up to the next comma.

Refer to the Regular Expressions lesson in Oracle's Java tutorials.

  •  Tags:  
  • java
  • Related