plgrep

version 0.1.4, cbm 6/10/09

An implementation of grep in perl, with many powerful new features.

Prerequisites

Download

Feature Summary

The man page is included below.


NAME

plgrep - grep enhanced with perl

SYNOPSIS

plgrep [options] pattern [file(s)]

DESCRIPTION

plgrep is a grep program with several enhancements. Some of these are taken from the GNU grep(1), some from the lesser-known rgrep(1), and some are entirely new.

plgrep supports the regular UNIX flavors of regular expressions ( grep, fgrep, and egrep), but since it is written in perl(1), it can also use perl-style regular expressions as in perlre(1), which are even more powerful than egrep. The default behavior, however, is still that of plain grep(1).

The flags for standard UNIX grep(1) (-chilnsvy) work as expected, though the -c behavior can be modified by other new options.

-c

print a count (per file) of the number of lines matching the pattern, instead of printing the lines themselves. (But also see -L, -p, -q, and -Q for possible modifications.)

-h

suppress printing of filenames when searching multiple files.

-i, -y

ignore upper/lower case.

-l

print only filenames containing matching lines (files with multiple matches only get their names printed once).

-n

print line numbers in front of matching lines.

-s

suppress warnings about nonexistent/inaccessible files.

-v

reverse the sense of matching, i.e., print or count non-matching lines.

GNU Enhancements

-e pattern

an alternate way of specifying the regular expression; useful if the pattern starts with a dash. Also has special behavior with -f; see below.

-E

treat the regular expression as in egrep(1).

-f patfile

read pattern(s) to search for from patfile. If patfile is ’-’, pattern(s) are read from standard input. Multiple lines in patfile are concatenated with the alternation operator ’|’. Thus, if patfile contains three lines specifying regular expressions A, B, and C, plgrep will search for ’(A|B|C)’. This is useful for working around the problem of overflowing the shell’s command-line buffer when searching for a very large number of alternative patterns. Also, this option can be combined specially with -e (not a GNU feature): if the -e expression contains %s, the ’|’-concatenated regular expression obtained from reading patfile is inserted into the -e regular expression anywhere a ’%s’ appears. For example, specifying

plgrep -e ’/%s:’ -f patfile

where patfile contains the three regular expressions A, B, and C, results in a final search pattern of ’/(A|B|C):’.

See the next section for additional extensions to the -f option.

-F

treat the regular expression as in fgrep(1), i.e., no metacharacters.

-G

treat the regular expression as in grep(1) (the default).

-w

match the pattern only on word boundaries.

-x

match the pattern only if the whole line matches, i.e., match ’^pattern$’.

New Options

-a, -aa

read filenames from standard input instead of the command line. Under -a, any whitespace delimits a filename (useful for e.g. ’ls | plgrep -a pat’); under -aa, each line is taken as a single filename. This is useful in a pipeline, or for huge numbers of filenames.

-A N[,N|-N...][delim]

split each line on delimiter delim and grep only in the Nth field(s). Fields are numbered starting from 1 on the left. Field indexes can be given as N,N,N and/or N-N to specify a range. delim may be a single character or a perl(1) regular expression (in single quotes). If delim is not given, splits on white space like awk(1).

-AA N[,N|-N...][delim]

same as -A, except that leading empty fields are not counted; or in other words, leading instances of the delimiter are stripped before counting fields.

-b

print the matching part(s) of each line in bold. The escape sequences from tput(1) are used to embolden text.

-B

print the entire file, printing matches in bold as in -b; useful for seeing matches in context.

-C

ignore C/C++ comments when matching.

-d

debug: print the regular expression (on stderr) after it has been massaged into its perl form.

-D

debug: print each filename (on stderr) as it is processed.

-f :patfile

If the -f patfile argument is preceded by a ’:’ character, patfile denotes a "color pattern" file, which directs plgrep to colorize its output. A color pattern file has one regex per line preceded by a color specification:

colorspec whitespace regex

Regexps are treated in the usual fashion, but "colorspec" indicates how each is colored in the output. The colorspec consists of 0-2 digits optionally followed by "+". A single digit specifies the foreground text color; a second digit specifies the background color; and "+" indicates that the text should be bold (or on terminals that support it, brightened). If colorspec is omitted, the matched regex will be printed but not colorized. Colors are numbered by the terminal’s color palette; typically colors 0-9 are black, red, green, yellow, blue, magenta, cyan, white, white, and "default", respectively. Note: you may need to set your terminal to something like "xterm", and manipulate your path for a modern version of tput(1) (e.g. from the ncurses package) to make colorization work properly. You can test different color outputs using the shell commands

[tput setaf #;] [tput setab #;] [tput bold;] echo string; tput sgr0

where the setaf and setab tput arguments are single digits representing the text and background colors, respectively.

-H

opposite of -h; force printing of filename even if only one file is given.

-I

set perl’s input record separator, which by default is "\n". This is useful for processing multi-line records.

-j string

only affects the printing behavior of -p; ignored if -p is not given. If an input line has multiple matches, the matching subparts are concatenated using ’join($string,...)’ and output as a single line. For example,

plgrep -p -j , ’[A-Z][a-z]+’

will print a comma-separated list of all capitalized words on each line.

-k N

under -o, tty interrupts will be sent to the child process rather than to plgrep itself. If N > 0, the child process will also be sent SIGALRM after N seconds.

-K

under -k, print a line to stderr indicating <INT> or <ALRM> when a child process is sent SIGINT or SIGALRM. The line is formatted as if it had been grepped from the file, e.g. using -t.

-L N

print names of files with exactly N matches (equivalent to -L N-N below).

-L [N1]-[N2]

like -LN, but print names of files with N1 <= N <= N2 matches. If given N1 > N2, plgrep swaps the values and treats them as -LN2-N1. Either or both of N1 and N2 may be zero. If omitted, N1 and N2 default to 1 and infinity, respectively, except in the case of -L-0 which is equivalent to -L0. -L1- is equivalent to -l. Note that -c does not cause -l or -L to be ignored, unlike UNIX grep. A file’s name and match-count are printed only if it the count falls in the specified range. -L overrides -l.

-m N[,N...]

ignored unless -p is in effect. Print (or count) only the Nth numbered match(es) on each line, if present. For example, if -p -m 2 are in effect, print only the second instance of a pattern match on any given line. If -p -c -m 2 are in effect, count only lines which have at least 2 instances of the match. If -p -l -m 2 are in effect, list only the files containing a line which has at least 2 instances of the match.

Multiple values of N can be specified, causing multiple different matches to be printed or counted per line, if they are present.

-M

instead of printing the text of a matching line, just print the number of matches on that line (when there is a match). Turns off -p.

-N

like -n, but print only the line number, not the matching text. Turns off -M.

-o cmd

instead of opening files directly for searching, execute cmd via popen(2) for each filename argument. If cmd contains any instances of %s, they are each substituted with the current filename; otherwise, if %s does not occur in cmd, the filename is appended to the end of cmd. This option has a similar function to xargs(1) in that it executes a command for every argument, but plgrep makes it easy to identify which argument a given output line came from. Here are some example usages:

plgrep -o gzcat pattern *.gz

plgrep -o strings pattern /bin/*

plgrep -o ’ssh %s who’ user host1 host2 ...

echo $path | plgrep -a -o ls command

df -lk | grep efs | cut -d’ ’ -f1 | plgrep -a -o ’quot -v’ user

-O N

instead of printing all matches, print only the Nth match.

-O [N1]-[N2]

like -O N, but print the range of matches from N1 to N2 inclusive. If omitted, N1 and N2 default to 1 and infinity, respectively.

-P

treat the pattern as a perl-5 regular expression (see perlre(1)).

-p

treat multiple matches on a single line separately. Such matches are maximal and must not overlap. Every non-overlapping match increments the match count, and the matching part(s) of the input line is/are printed as separate output lines (unless any of [-clL] are in effect, which suppress the output; or if -j is in effect, which prints all matching parts of the input line on a single input line using a join string [see -j]).

If -p and -P are given together, parentheses can be used to print only a subpart of the matching expression. For example,

plgrep -pP ’^cbm:.*:(.*)’ /etc/passwd

could be used to print cbm’s login shell, and

plgrep -pP ’^cbm:.*?:(.*?):’ /etc/passwd

would print out cbm’s UID (see perlre(1) for a discussion of the *? operator).

-q, -Q N

unconditionally quits the current file after one (or N) matches are found. -q is equivalent to -Q 1. If given with -c, counts are truncated at 1 or N, respectively.

-QQ [N]

like -q/-Q, but prints every line up to the Nth match (N=1 default).

-r

highlight (reverse-print) the matching part(s) of each line on stdout. The escape sequences from tput(1) are used to highlight text.

-R

print the entire file, highlighting matches as in -r; useful for seeing matches in context.

-t format

instead of using the traditional grep output, format each output line (when a match occurs) using a printf-like control string. The following %-sequences are recognized:

%acumulative lengths of %g matches (this file)
%Acumulative lengths of %g matches (all files)
%bbyte number (this file), at end of matching line
%Bbyte number (all files)
%ccumulative number of matches (this file)
%Ccumulative number of matches (all files)
%dnumber of lines since last match
%Dnumber of bytes since last match
%ffilename
%gtext of matching line, or text of match under -p
%llength of %g text
%mnumber of matches on this line
%nline number (this file)
%Ncumulative line number across all files
%ttime (unix epoch) line printed
%Ttime (unix epoch) file opened
%xtext of matching line regardless of -p setting
%zz-label (see -z option)

-T

pre-scan each file (using the perl -T operator) to grep on text-only files. Increases overhead, but filters out garbage from binary files.

-V regex

print all lines (inclusive) between a line matching the main pattern and the first subsequent line matching regex. For example, if a file contains the lines

apple
banana
cherry
date
eggfruit

the command
plgrep -V 'e$' banana file
would print

banana
cherry
date

Additionally, if the flags
-V regex -V 'jstring'
Additionally, if the flags are given, the lines will be joined with string (which may be null) and printed on a single line. Thus, in the above example, the command
plgrep -V 'e$' -V 'j, ' banana file
would print
banana, cherry, date

-X

prints nothing on stdout; simply exits with the appropriate status (0 for at least one match, 1 for no matches, 2 for errors). Turns on -q and turns off all other printing and counting flags ( -bBchHlLnNOpQrR).

-z regex

search for a separate labelling regex. If this regex matches, it is not printed or counted, but saved in a register whose contents can be printed using the %z code of the -t formatting option. The -z regex is interpreted according to the -FGEP flags.

Interaction of Flags

Some of plgrep’s new features are discussed here to explain the powerful but sometimes nonintuitive behavior when several flags from the set [-clLpqQv] are used together.

plgrep always counts the number of times the regular expression is matched as it processes each file. This ’match count’ refers by default to the number of lines that contain at least one instance of the pattern. However, the flags -pqQv modify this behavior. The -p flag causes multiple matches on one line to be counted individually (such matches are maximal and nonoverlapping). Under -v, the matching sense is reversed -- lines not matching the pattern are counted (once). (Perhaps obvious is that -p and -v don’t make sense together -- since -v matches the entire line as a single match.) The -q and -Q flags do not change the matching or counting behavior, but set an upper limit on the number of matches to be found (per file). If this limit is reached, scanning of the current file stops immediately and the match count for that file is truncated at the upper limit.

By default, plgrep runs in ’print’ mode; that is, it prints lines that match the regular expression. However, any of the -clL flags put plgrep into ’list’ mode -- instead of printing matched lines, it prints filenames on stdout. The -c flag by itself prints all filenames and their match counts (tabulated as described above). If -l or -L N appear, the names of only those files whose match count is at least 1 (under -l) or N (under -L) are printed. Also note that unlike standard grep, plgrep’s -c flag does not cause -l (or -L) to be ignored; this combination has the more desireable effect of printing only filenames selected by -l (or -L) with their match counts.

Keep in mind that each file is scanned in its entirety unless the first (or Nth) pattern match occurs and either (1) -q (or -Q) is given, or (2) -l (or -L) appear without -c.

Also note that -l or -L can be combined usefully with -q or -Q, but only in the presence of -c (the -l/-L flag establishes a minimum match count, and -q/-Q establishes a maximum).

SEE ALSO

grep(1), fgrep(1), egrep(1), rgrep(1), perl(1), perlre(1), tput(1).

AUTHOR

Cliff Miller (see http://www.nightcoder.com).

WISH LIST

Multiple patterns. It sure would be nice to be able to say something like

plgrep -li -e ’^From:.*ralph’ -e ’^To:.*cbm’

and have plgrep list only the files that contain BOTH lines. Actually this could be partly implemented by using perl’s pargraph mode. But I’m sure there are other uses of having a match count array (one count for each pattern), then allowing things like -L n1,n2,n3 ... or -Q n1,n2,n3 ...

BUGS

Some of the more obscure treatments of regular expressions by the UNIX grep family may not be emulated correctly. One instance I know of is that many implementations of egrep(1) do not appear to treat \{, \}, or \[1-9] as advertised on their man pages. If you feel a given regular expression is not receiving proper treatment from plgrep, please read the extensive discussion found in the plgrep script itself. The treatment of metacharacters and backslashes is fully explained therein. You may also find a way to get the behavior you want (if you just want the damn thing to work, as is often the case).

The -p flag actually counts all its matches before the upper count limit of -q/-Q is enforced, so a combination of flags like -cpq or -cp -Q N may cause some counts larger than the limit to appear.

The -r flag cannot highlight grouped subexpressions in the same way that -p can print them; i.e.,

plgrep -Pp ’a(b)’

will print only the ’b’ part of lines where ’a’ precedes ’b’ (as expected), but

plgrep -Pr ’a(b)’

will highlight both ’a’ and ’b’. This is due to limitations in the perl regex grouping operator ’()’.