Home > other >  Filter from large log file
Filter from large log file

Time:11-17

I would like to do the following without changing the file for a large log file in Windows format

  1. Remove all CRLF characters
  2. Insert a blank line between the "CLG..." "TRC..." in the last line of the log file
  3. After reading the results in paragraph mode, print the paragraph if a particular string exists

code below does not work.

use strict;
use warnings;

my $ID = "[email protected]";
my $SDP;

open (LOG, "file.log") || die $!;

my $line;

while(<LOG>) {
        $line .= $_;
        $line =~s/\r//g;
}

local $/ = '';

while (<>) {
    if ( /Call-ID:\s (. )/ and $ID ) {
        $SDP = 1;
        print;        
        next;
    }

    print if $SDP && /\brtpmap\b/;

    $SDP = 0;
}

close(LOG);

Jan 28 11:39:37.525 CET: //1393628/D5CC0586A87B/SIP/Msg/ccsipDisplayMsg:^M
Received:^M 
SIP/2.0 200 OK^M
Via: SIP/2.0/UDP 10.218.16.2:5060;branch=z9hG4bKB22001ED5^M
From: "Frankeerapparaat Secretariaat" <sip:[email protected]>;tag=E7E0EF64-192F^M
To: <sip:[email protected]>;tag=25079324~19cc0abf-61d9-407f-a138-96eaffee1467-27521338^M
Date: Mon, 28 Jan 2013 10:39:32 GMT^M
Call-ID: [email protected]^M
CSeq: 102 INVITE^M
Allow: INVITE, OPTIONS, INFO, BYE, CANCEL, ACK, PRACK, UPDATE, REFER, SUBSCRIBE, NOTIFY^M
Allow-Events: presence^M
Supported: replaces^M
Supported: X-cisco-srtp-fallback^M
Supported: Geolocation^M
Session-Expires:  1800;refresher=uas^M
Require:  timer^M
P-Preferred-Identity: <sip:[email protected]>^M
Remote-Party-ID: <sip:[email protected]>;party=called;screen=no;privacy=off^M
Contact: <sip:[email protected]:5060>^M
Content-Type: application/sdp^M
Content-Length: 209^M
^M
v=0^M
o=CiscoSystemsCCM-SIP 2000 1 IN IP4 10.210.2.49^M
s=SIP Call^M
c=IN IP4 10.210.2.1^M
t=0 0^M
m=audio 16844 RTP/AVP 8 101^M
a=rtpmap:8 PCMA/8000^M
a=ptime:20^M
a=rtpmap:101 telephone-event/8000^M
a=fmtp:101 0-15^M
^M
Jan 28 11:39:37.529 CET: //1393628/D5CC0586A87B/SIP/Msg/ccsipDisplayMsg:^M
Sent:^M
ACK sip:[email protected]:5060 SIP/2.0^M
Via: SIP/2.0/UDP 10.218.16.2:5060;branch=z9hG4bKB2247150A^M
From: "Frankeerapparaat Secretariaat" <sip:[email protected]>;tag=E7E0EF64-192F^M
To: <sip:[email protected]>;tag=25079324~19cc0abf-61d9-407f-a138-96eaffee1467-27521338^M
Date: Mon, 28 Jan 2013 10:39:36 GMT^M
Call-ID: [email protected]^M
Max-Forwards: 70^M
CSeq: 102 ACK^M
Authorization: Digest username="Genk_AC_1",realm="infraxnet.be",uri="sip:[email protected]:5060",response="9546733290a96d1470cfe29a7500c488",nonce="5V/Jt8FHd5I8uaoahshiaUud8O6UujJJ",algorithm=MD5^M
Allow-Events: telephone-event^M
Content-Length: 0^M
^M
^M
Jan 28 11:39:37.529 CET: //1393627/D5CC0586A87B/SIP/Msg/ccsipDisplayMsg:^M
Sent:^M
SIP/2.0 200 OK^M
Via: SIP/2.0/UDP 192.168.8.11:5060;branch=z9hG4bK24ecaaaa6dbd3^M
From: "Frankeerapparaat Secretariaat" <sip:[email protected]>;tag=e206cc93-1791-457a-aaac-1541296cf17c-29093746^M
To: <sip:[email protected]>;tag=E7E0F8A4-EA3^M
Date: Mon, 28 Jan 2013 10:39:32 GMT^M
Call-ID: [email protected]^M
CSeq: 101 INVITE^M
Allow: INVITE, OPTIONS, BYE, CANCEL, ACK, PRACK, UPDATE, REFER, SUBSCRIBE, NOTIFY, INFO, REGISTER^M
Allow-Events: telephone-event^M
Remote-Party-ID: <sip:[email protected]>;party=called;screen=no;privacy=off^M
Contact: <sip:[email protected]:5060>^M
Supported: replaces^M
Supported: sdp-anat^M
Server: Cisco-SIPGateway/IOS-15.3.1.T^M
Session-Expires:  1800;refresher=uas^M
Require: timer^M
Supported: timer^M
Content-Type: application/sdp^M
Content-Disposition: session;handling=required^M
Content-Length: 247^M
^M
v=0^M
o=CiscoSystemsSIP-GW-UserAgent 7276 9141 IN IP4 192.168.8.28^M
s=SIP Call^M
c=IN IP4 192.168.8.28^M
t=0 0^M
m=audio 30134 RTP/AVP 8 101^M
c=IN IP4 192.168.8.28^M
a=rtpmap:8 PCMA/8000^M
a=rtpmap:101 telephone-event/8000^M
a=fmtp:101 0-15^M
a=ptime:20^M
^M
CLG(2022-11-07 00:09:06.444)| Call(Terminate) | 302A330B040C73070A021806021C0200 | ^M
TRC(2022-11-15 00:00:38.012)| SIP( OUT : Response ) Trying( 100 INVITE ) | 2 |  | 0 | 332C30050A0F750A00011A06021C0200 | SIP/2.0 100 Trying^M

CodePudding user response:

There are a many things that are getting in your way here. I'll think I'll get close to what you are trying to do, but I have to make some guesses.

First, the bare @ in $ID is interpolating and you are missing the @10 in your string when it should have been a literal. You aren't getting a warning perhaps because that identifier would be a Perl special variable rather than a user-defined one.

Second, you have some weird filehandling there.

You're building up a modified log file into a single, big string in $line. You say that you have large files, and that means different things to different people. But some people work in contexts were "large" is tens or hundreds of gigs. Don't do that.

Third, you don't do anything with $line after you build it up. I think that you are expecting to read it again with <>.

I'd approach this a bit differently. I don't care so much about the line-endings right off the bat. If I'm going through millions of records, I don't want to spend time converting every line when I'm going to ignore most of them. I can convert that when I have stuff to output. That also depends a bit on the proportion of hits you expect. If you are printing almost every record, it doesn't matter as much as if you are printing 1% of them.

To start, I know that the file format looks almost like the email mbox format. That first line with the date ruins it all because it doesn't have a fixed string that you can use to see the start of the record like the mbox envelope does. This also means that since the entire record (header and body) are separated by CRLFCRLF and the records themselves are separated by CRLFCRLF, it's a bit tricky to get a complete record in paragraph mode.

So, lets read chunks separated by CRLFCRLF. The first chunk should be the header and the second chunk should be the body. There's a chance to get out of step here and there are some things you can do to recover from that, but I'll skip that here. Basically, inspect the chunk and see if it is what you expected (begins with date, etc). If you are curious about that sort of thing, the design of UTF-8 is interesting since it started with the idea that things can get garbled but you can get back on track.

Here's what we have so far. Single-quote $ID to get the true value, and set up ARGV (the filehandle for the empty <>) to use CRLFCRLF as the input record separator ($/). That's a per-filehandle variable that works on the currently selected (default) filehandle, so I select ARGV, set the value, then reselect the previous default. It's weird, but let's leave it at that. Then, the meat of my program is a while loop:

use v5.10;
use strict;
use warnings;

my $ID = '[email protected]';

my $old = select(ARGV);
$/ = "\x0D\x0A" x 2;
select($old);

while( <> ) {
    ...
    }

The outer <> gets the header ($_), and I always have to check for a body even if I want to skip that entire record. That merely keeps the reading synchronized (and your other requirements forebode some other ways to get out of sync). The trick is that I have to look at the Content-length header to see if there's a body. I don't particularly trust that length value though because I haven't done the work to see if it's from LF output or CRLF output (that is, the logger added octets without changing the Content-Length header).

There are many ways to do this, but this is simple enough: check if the content length is greater than zero:

while( <> ) {
    my $body = '';
    $body = <> if( /\vContent-Length:\s ([0-9] )/i and $1 > 0 );

    ... filter goes here ...

    print $_, $body
    }

Now I need to decide if I want this record. You have two requirements, I think:

  • Call-ID has the $ID value
  • The body has rtpmap

Start with the $ID value. I want that to be the header value, so I want that in the pattern. For that, I use quotemeta to prepare the string to be interpolated into the pattern (the . is a special char). You have a line /Call-ID:\s (. )/ and $ID where I think you thought the capture value from (. ) would be compared to $ID, but that's not how it works.

my $ID = quotemeta('[email protected]');
while( <> ) {
    my $body = '';
    $body = <> if( /\vContent-Length:\s ([0-9] )/i and $1 > 0 );
    next unless /\vCall-ID:\s $ID/;
    print $_, $body
    }

Here's an interesting note. I can't use the ^ beginning of line anchor because my line ending is CRLFCRLF, but the internal lines are separated by CRLF. I do know that the header will have vertical whitespace before it, so I add a \v to anchor it. Not a big deal.

Now, my pattern isn't going to change, so I can pre-compile that with qr//. Later I can use that right in the m//:

my $ID = quotemeta('[email protected]');
my $header_pattern = qr/\vCall-ID:\s $ID/;

while( <> ) {
    my $body = '';
    $body = <> if( /\vContent-Length:\s ([0-9] )/i and $1 > 0 );
    next unless /$header_pattern/;
    next unless $body =~ /\brtpmap\b/;

    print $header, $body
    }

I might be content to leave it like this. The output will still have CRLF feeds, but who cares? I can fix that in another program such as dos2unix if it really matters.

In your example, I don't think the line endings matter. It looks like you want to extract a couple values. I think that you want to print the value of the Call-ID line rtpmap value. Your code doesn't quite work for that for various reasons as you are trying to remember where you are and you stomp all over your state. Instead, I now have the header and body, I use captures to get the values, then I output them joined with newlines. I never converted the line endings because I never needed to.

while( <> ) {
    my $body = '';
    $body = <> if( /\vContent-Length:\s ([0-9] )/i and $1 > 0 );
    next unless m/$header_pattern/;
    my $this_id = $1;
    next unless $body =~ /\b(rtpmap:[^\v] )/g;

    print join "\n", $this_id, $;
    }

But there's a another problem. The body has multiple rtpmap lines. If I want all of them, I need to make an adjusment. I can match the body in a global match and check how many results I get:

    my @rtpmaps = $body =~ /\b(rtpmap:[^\v] )/g;
    next unless @rtpmaps > 0;
    print join "\n", $this_id, @rtpmaps;

Here it is all together:

#perl
use v5.10;
use warnings;

my $ID = quotemeta('[email protected]');
my $header_pattern = qr/\vCall-ID:\s ($ID)/;

my $old = select(ARGV);
$/ = "\x0D\x0A" x 2;
select($old);

chdir '/Users/brian/Desktop';
@ARGV = 'test.log';

while( <> ) {
    my $body = '';
    $body = <> if( /\vContent-Length:\s ([0-9] )/i and $1 > 0 );
    next unless m/$header_pattern/;
    next unless length $body;
    my $this_id = $1;
    my @rtpmaps = $body =~ /\b(rtpmap:[^\v] )/g;
    next unless @rtpmaps > 0;

    print join "\n", $this_id, @rtpmaps;
    }

You had an additional requirement with the lines beginning with CTG and TRC. You can inspect the line before you look for a body and decide what you'd like to do with those.

CodePudding user response:

I assume you are running on a system that uses Unix-style line endings, otherwise the file's Windows line endings would not be a problem. The key to handling Windows files under Unix is to let Perl do the dirty work by using the :crlf I/O layer when you open the file. To do this, you need to use the three-argument version of open(). In your case this is open LOG, '<:crlf', 'file.log' or die $!. Note that I do not need the parentheses in the open() because I have used the loosely-binding or rather than the tightly-binding ||.

The following is how I would implement your code, assuming I understand your requirements:

#!/usr/bin/env perl

use 5.010;        # for \K

use strict;
use warnings;

open my $log, '<:crlf', 'file.log'
    or die "Failed to open file.log: $!\n";

local $/ = '';
my $state = \&state_1;
while ( <$log> ) {
    if ( eof $log ) {
        s/ ^ CLG .*? \n \K (?= TRC ) /\n/smx;
    }
    $state = $state->();
}

sub state_1 {
    if ( m/ Call-ID: \s  /smx ) {
        print;
        return \&state_2;
    }
    return \&state_1;
}

sub state_2 {
    if ( m/ \b rtpmap \b /smx ) {
        print;
    }
    return \&state_1;
}

# ex: set ts=8 sts=4 sw=4 tw=72 ft=perl expandtab shiftround :

Rather than do logic on flag variables (your $SDP) I just implemented a state machine.

My logic does not mention $ID because the value you give is always true. If $ID is false I believe no output at all should be produced.

Strictly speaking, $/ should be localized to prevent Spooky Action at a Distance, but in a small script like this it is not likely to cause problems.

The if ( eof $log ) ... implements your requirement that a blank line be inserted between two lines in the last paragraph. If your intent was to break this into two paragraphs you will need a different implementation.

  •  Tags:  
  • perl
  • Related