MAWK(1)

MAWK(1)

mattrib Home Page User Commands Index maze


NAME
       mawk - pattern scanning and text processing language

SYNOPSIS
       mawk  [-W  option] [-F value] [-v var=value] [--] 'program
       text' [file ...]
       mawk [-W option] [-F value] [-v  var=value]  [-f  program-
       file] [--] [file ...]

DESCRIPTION
       mawk  is  an interpreter for the AWK Programming Language.
       The AWK language is useful for manipulation of data files,
       text  retrieval  and  processing,  and for prototyping and
       experimenting with algorithms.  mawk is a new awk  meaning
       it   implements  the  AWK  language  as  defined  in  Aho,
       Kernighan and Weinberger, The  AWK  Programming  Language,
       Addison-Wesley  Publishing,  1988.  (Hereafter referred to
       as the AWK book.)   mawk  conforms  to  the  Posix  1003.2
       (draft 11.3) definition of the AWK language which contains
       a few features not described in the AWK  book,   and  mawk
       provides a small number of extensions.

       An AWK program is a sequence of pattern {action} pairs and
       function definitions.  Short programs are entered  on  the
       command line usually enclosed in ' ' to avoid shell inter-
       pretation.  Longer programs can be read  in  from  a  file
       with  the -f option.  Data  input is read from the list of
       files on the command line or from standard input when  the
       list is empty.  The input is broken into records as deter-
       mined by the record separator variable, RS.  Initially, RS
       = "\n" and records are synonymous with lines.  Each record
       is compared against each pattern and if  it  matches,  the
       program text for {action} is executed.

OPTIONS
       -F value       sets the field separator, FS, to value.

       -f file        Program  text  is read from file instead of
                      from the command line.  Multiple -f options
                      are allowed.

       -v var=value   assigns value to program variable var.

       --             indicates the unambiguous end of options.

       The above options will be available with any Posix compat-
       ible implementation of AWK,  and  implementation  specific
       options are prefaced with -W.  mawk provides six:

       -W version     mawk  writes  its  version and copyright to
                      stdout and compiled limits  to  stderr  and
                      exits 0.

       -W dump        writes  an  assembler  like  listing of the
                      internal representation of the  program  to
                      stdout  and exits 0 (on successful compila-
                      tion).

       -W interactive sets unbuffered writes to stdout  and  line
                      buffered  reads  from  stdin.  Records from
                      stdin are lines regardless of the value  of
                      RS.

       -W exec file   Program  text is read from file and this is
                      the last option.  Useful  on  systems  that
                      support  the  #!  "magic number" convention
                      for executable scripts.

       -W sprintf=num adjusts the size of mawk's internal sprintf
                      buffer to num bytes.  More than rare use of
                      this option indicates mawk should be recom-
                      piled.

       -W posix_space forces  mawk  not  to  consider  '\n' to be
                      space.

       The short forms -W[vdiesp] are recognized and on some sys-
       tems -We is mandatory to avoid command line length limita-
       tions.

THE AWK LANGUAGE
   1. Program structure
       An AWK program is a sequence of pattern {action} pairs and
       user function definitions.

       A pattern can be:
              BEGIN
              END
              expression
              expression , expression

       One,  but  not  both,  of pattern {action} can be omitted.
       If {action} is omitted it is implicitly  {  print  }.   If
       pattern  is omitted, then it is implicitly matched.  BEGIN
       and END patterns require an action.

       Statements are  terminated  by  newlines,  semi-colons  or
       both.  Groups of statements such as actions or loop bodies
       are blocked via { ... } as in C.  The last statement in  a
       block  doesn't  need  a  terminator.   Blank lines have no
       meaning; an empty statement is  terminated  with  a  semi-
       colon.  Long statements can be continued with a backslash,
       \.  A statement can be broken without a backslash after  a
       comma, left brace, &&, ||, do, else, the right parenthesis
       of an if, while or for statement, and the right  parenthe-
       sis of a function definition.  A comment starts with # and
       extends to, but does not include the end of line.

       The  following  statements  control  program  flow  inside
       blocks.
              if ( expr ) statement
              if ( expr ) statement else statement
              while ( expr ) statement
              do statement while ( expr )
              for ( opt_expr ; opt_expr ; opt_expr ) statement
              for ( var in array ) statement
              continue
              break

   2. Data types, conversion and comparison
       There  are  two  basic  data  types,  numeric  and string.
       Numeric constants can be integer  like  -2,  decimal  like
       1.08,  or  in  scientific  notation like -1.1e4 or .28E-3.
       All numbers are represented internally  and  all  computa-
       tions are done in floating point arithmetic.  So for exam-
       ple, the expression 0.2e2 == 20 is true and true is repre-
       sented as 1.0.

       String constants are enclosed in double quotes.

            "This is a string with a newline at the end.\n"

       Strings can be continued across a line by escaping (\) the
       newline.  The following escape sequences are recognized.

            \\        \
            \"        "
            \a        alert, ascii 7
            \b        backspace, ascii 8
            \t        tab, ascii 9
            \n        newline, ascii 10
            \v        vertical tab, ascii 11
            \f        formfeed, ascii 12
            \r        carriage return, ascii 13
            \ddd      1, 2 or 3 octal digits for ascii ddd
            \xhh      1 or 2 hex digits for ascii  hh

       If you escape any other character \c, you  get  \c,  i.e.,
       mawk ignores the escape.

       There are really three basic data types; the third is num-
       ber and string which has both a numeric value and a string
       value  at the same time.  User defined variables come into
       existence when first referenced  and  are  initialized  to

       null,  a number and string value which has numeric value 0
       and string value "".  Non-trivial number and string  typed
       data  come  from input and are typically stored in fields.
       (See section 4).

       The type of an expression is determined by its context and
       automatic  type conversion occurs if needed.  For example,
       to evaluate the statements

            y = x + 2  ;  z = x  "hello"

       The value stored in variable y will be typed numeric.   If
       x  is  not  numeric, the value read from x is converted to
       numeric before it is added to 2  and  stored  in  y.   The
       value  stored  in variable z will be typed string, and the
       value of x will be converted to string  if  necessary  and
       concatenated with "hello".  (Of course, the value and type
       stored in x is not changed by any conversions.)  A  string
       expression  is  converted  to  numeric  using  its longest
       numeric prefix as with atof(3).  A numeric  expression  is
       converted  to  string  by replacing expr with sprintf(CON-
       VFMT, expr), unless expr can be represented  on  the  host
       machine  as  an  exact  integer  then  it  is converted to
       sprintf("%d", expr).  Sprintf() is an  AWK  built-in  that
       duplicates the functionality of sprintf(3), and CONVFMT is
       a built-in variable used for internal conversion from num-
       ber  to  string  and initialized to "%.6g".  Explicit type
       conversions can be forced, expr "" is string and expr+0 is
       numeric.

       To  evaluate,  expr1  rel-op  expr2,  if both operands are
       numeric or  number  and  string  then  the  comparison  is
       numeric;  if  both  operands  are string the comparison is
       string; if one operand is string, the  non-string  operand
       is  converted and the comparison is string.  The result is
       numeric, 1 or 0.

       In boolean contexts such as, if  (  expr  )  statement,  a
       string  expression evaluates true if and only if it is not
       the empty string ""; numeric values if  and  only  if  not
       numerically zero.

   3. Regular expressions
       In the AWK language, records, fields and strings are often
       tested for matching a regular expression.  Regular expres-
       sions are enclosed in slashes, and

            expr ~ /r/

       is an AWK expression that evaluates to 1 if expr "matches"
       r, which means a substring  of  expr  is  in  the  set  of
       strings defined by r.  With no match the expression evalu-
       ates to 0; replacing ~ with the "not match" operator, !~ ,
       reverses the meaning.  As  pattern-action pairs,

            /r/ { action }   and   $0 ~ /r/ { action }

       are  the  same,  and for each input record that matches r,
       action is executed.  In fact, /r/  is  an  AWK  expression
       that  is  equivalent to ($0 ~ /r/) anywhere except when on
       the right side of a match operator or passed as  an  argu-
       ment to a built-in function that expects a regular expres-
       sion argument.

       AWK uses extended regular expressions  as  with  egrep(1).
       The  regular  expression  metacharacters, i.e., those with
       special meaning in regular expressions are

             ^ $ . [ ] | ( ) * + ?

       Regular expressions are built up from characters  as  fol-
       lows:
              c            matches any non-metacharacter c.
              \c           matches  a  character  defined  by the
                           same escape sequences used  in  string
                           constants  or  the literal character c
                           if \c is not an escape sequence.
              .            matches any character (including  new-
                           line).
              ^            matches the front of a string.
              $            matches the back of a string.
              [c1c2c3...]  matches  any  character  in  the class
                           c1c2c3... .  An interval of characters
                           is denoted c1-c2 inside a class [...].
              [^c1c2c3...] matches any character not in the class
                           c1c2c3...

       Regular  expressions  are  built  up  from  other  regular
       expressions as follows:
              r1r2         matches r1 followed immediately by  r2
                           (concatenation).
              r1 | r2      matches r1 or r2 (alternation).
              r*           matches r repeated zero or more times.
              r+           matches r repeated one or more  times.
              r?           matches r zero or once.
              (r)          matches r, providing grouping.

       The  increasing  precedence  of  operators is alternation,
       concatenation and unary (*, + or ?).

       For example,

            /^[_a-zA-Z][_a-zA-Z0-9]*$/  and
            /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/

       are matched by AWK identifiers and AWK  numeric  constants
       respectively.   Note that . has to be escaped to be recog-
       nized as a decimal point, and that metacharacters are  not
       special inside character classes.

       Any expression can be used on the right hand side of the ~
       or !~ operators or passed to a  built-in  that  expects  a
       regular expression.  If needed, it is converted to string,
       and then interpreted as a regular expression.   For  exam-
       ple,

            BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }

            $0 ~ "^" identifier

       prints all lines that start with an AWK identifier.

       mawk  recognizes  the  empty regular expression, //, which
       matches the empty string  and  hence  is  matched  by  any
       string  at  the  front,  back and between every character.
       For example,

            echo  abc | mawk { gsub(//, "X") ; print }
            XaXbXcX

   4. Records and fields
       Records are read in one at a time, and stored in the field
       variable  $0.   The  record is split into fields which are
       stored in $1, $2, ..., $NF.  The built-in variable  NF  is
       set  to  the  number  of fields, and NR and FNR are incre-
       mented by 1.  Fields above $NF are set to "".

       Assignment to $0 causes the fields and  NF  to  be  recom-
       puted.   Assignment  to  NF  or to a field causes $0 to be
       reconstructed by concatenating the $i's separated by  OFS.
       Assignment   to  a  field  with  index  greater  than  NF,
       increases NF and causes $0 to be reconstructed.

       Data input stored in fields is string, unless  the  entire
       field  has  numeric  form  and then the type is number and
       string.  For example,

            echo 24 24E |
            mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
            0 1 1 1

       $0 and $2 are string and $1 is  number  and  string.   The
       first  comparison  is  numeric,  the second is string, the
       third is string (100 is converted to "100"), and the  last
       is string.

   5. Expressions and operators
       The  expression  syntax  is similar to C.  Primary expres-
       sions are numeric constants, string constants,  variables,
       fields,  arrays  and function calls.  The identifier for a
       variable, array or function can be a sequence of  letters,
       digits  and underscores, that does not start with a digit.
       Variables are not declared; they exist when  first  refer-
       enced and are initialized to null.

       New  expressions are composed with the following operators
       in order of increasing precedence.
              assignment          =  +=  -=  *=  /=  %=  ^=
              conditional         ?  :
              logical or          ||
              logical and         &&
              array membership    in
              matching       ~   !~
              relational          <  >   <=  >=  ==  !=
              concatenation       (no explicit operator)
              add ops             +  -
              mul ops             *  /  %
              unary               +  -
              logical not         !
              exponentiation      ^
              inc and dec         ++ -- (both post and pre)
              field               $

       Assignment, conditional and exponentiation associate right
       to left; the other operators associate left to right.  Any
       expression can be parenthesized.

   6. Arrays
       Awk provides one-dimensional arrays.  Array  elements  are
       expressed as array[expr].  Expr is internally converted to
       string type, so, for example, A[1] and A["1"] are the same
       element  and  the  actual index is "1".  Arrays indexed by
       strings are called associative arrays.  Initially an array
       is  empty; elements exist when first accessed.  An expres-
       sion, expr in array evaluates to 1 if array[expr]  exists,
       else to 0.

       There  is a form of the for statement that loops over each
       index of an array.

            for ( var in array ) statement

       sets var to each index of array  and  executes  statement.
       The order that var transverses the indices of array is not

       defined.

       The statement, delete array[expr], causes array[expr]  not
       to exist.  mawk supports an extension, delete array, which
       deletes all elements of array.

       Multidimensional arrays are synthesized with concatenation
       using the built-in variable SUBSEP.  array[expr1,expr2] is
       equivalent to array[expr1 SUBSEP expr2].   Testing  for  a
       multidimensional  element uses a parenthesized index, such
       as

            if ( (i, j) in A )  print A[i, j]

   7. Builtin-variables
       The  following  variables  are  built-in  and  initialized
       before program execution.
              ARGC      number of command line arguments.
              ARGV      array    of   command   line   arguments,
                        0..ARGC-1.
              CONVFMT   format for internal conversion of numbers
                        to string, initially = "%.6g".
              ENVIRON   array  indexed  by environment variables.
                        An  environment  string,   var=value   is
                        stored as ENVIRON[var] = value.
              FILENAME  name of the current input file.
              FNR       current record number in FILENAME.
              FS        splits  records  into fields as a regular
                        expression.
              NF        number of fields in the current record.
              NR        current record number in the total  input
                        stream.
              OFMT      format  for printing numbers; initially =
                        "%.6g".
              OFS       inserted between fields on  output,  ini-
                        tially = " ".
              ORS       terminates  each  record  on output, ini-
                        tially = "\n".
              RLENGTH   length set by the last call to the built-
                        in function, match().
              RS        input record separator, initially = "\n".
              RSTART    index set by the last call to match().
              SUBSEP    used to build multiple array  subscripts,
                        initially = "\034".

   8. Built-in functions
       String functions
              gsub(r,s,t)  gsub(r,s)
                     Global  substitution, every match of regular
                     expression r in variable t  is  replaced  by
                     string  s.   The  number  of replacements is
                     returned.  If t is omitted, $0 is used.   An
                     & in the replacement string s is replaced by
                     the matched substring of t.  \& and  \\  put
                     literal   &  and  \,  respectively,  in  the
                     replacement string.
              index(s,t)
                     If t is a substring of s, then the  position
                     where  t  starts  is  returned,  else  0  is
                     returned.  The first character of  s  is  in
                     position 1.
              length(s)
                     Returns the length of string s.
              match(s,r)
                     Returns the index of the first longest match
                     of  regular  expression  r  in   string   s.
                     Returns  0  if  no match.  As a side effect,
                     RSTART is set to the return value.   RLENGTH
                     is  set  to the length of the match or -1 if
                     no match.  If the empty string  is  matched,
                     RLENGTH  is  set  to 0, and 1 is returned if
                     the match is at the front,  and  length(s)+1
                     is returned if the match is at the back.
              split(s,A,r)  split(s,A)
                     String  s  is  split  into fields by regular
                     expression r and the fields are loaded  into
                     array  A.  The number of fields is returned.
                     See section 11 below for more detail.  If  r
                     is omitted, FS is used.
              sprintf(format,expr-list)
                     Returns  a string constructed from expr-list
                     according to format.  See the description of
                     printf() below.
              sub(r,s,t)  sub(r,s)
                     Single  substitution,  same as gsub() except
                     at most one substitution.
              substr(s,i,n)  substr(s,i)
                     Returns the substring of string s,  starting
                     at  index  i, of length n.  If n is omitted,
                     the suffix of s, starting at i is  returned.
              tolower(s)
                     Returns  a  copy  of  s  with all upper case
                     characters converted to lower case.
              toupper(s)
                     Returns a copy of  s  with  all  lower  case
                     characters converted to upper case.

       Arithmetic functions
              atan2(y,x)     Arctan of y/x between - and .
              cos(x)         Cosine function, x in radians.
              exp(x)         Exponential function.
              int(x)         Returns x truncated towards zero.
              log(x)         Natural logarithm.
              rand()         Returns a random number between zero and one.
              sin(x)         Sine function, x in radians.
              sqrt(x)        Returns square root of x.
              srand(expr)  srand()
                     Seeds the random number generator, using the
                     clock if expr is omitted,  and  returns  the
                     value  of the previous seed.  mawk seeds the
                     random number generator from  the  clock  at
                     startup  so  there  is  no real need to call
                     srand().  Srand(expr) is useful for  repeat-
                     ing pseudo random sequences.

   9. Input and output
       There are two output statements, print and printf.
              print  writes $0  ORS to standard output.
              print expr1, expr2, ..., exprn
                     writes  expr1 OFS expr2 OFS ... exprn ORS to
                     standard output.   Numeric  expressions  are
                     converted to string with OFMT.
              printf format, expr-list
                     duplicates  the  printf  C  library function
                     writing to standard  output.   The  complete
                     ANSI  C format specifications are recognized
                     with conversions %c, %d, %e, %E, %f, %g, %G,
                     %i,  %o,  %s, %u, %x, %X and %%, and conver-
                     sion qualifiers h and l.

       The argument list to print or  printf  can  optionally  be
       enclosed in parentheses.  Print formats numbers using OFMT
       or "%d" for exact integers.  "%c" with a numeric  argument
       prints  the  corresponding  8 bit character, with a string
       argument it prints the first character of the string.  The
       output  of print and printf can be redirected to a file or
       command by appending > file, >> file or | command  to  the
       end  of  the  print  statement.  Redirection opens file or
       command only once, subsequent redirections append  to  the
       already  open  stream.  By convention, mawk associates the
       filename "/dev/stderr" with stderr which allows print  and
       printf  to  be redirected to stderr.  mawk also associates
       "-" and "/dev/stdout" with stdin and stdout  which  allows
       these streams to be passed to functions.

       The input function getline has the following variations.
              getline
                     reads  into  $0,  updates the fields, NF, NR
                     and FNR.
              getline < file
                     reads into $0 from file, updates the  fields
                     and NF.
              getline var
                     reads  the  next record into var, updates NR
                     and FNR.
              getline var < file
                     reads the next record of file into var.
               command | getline
                     pipes a record  from  command  into  $0  and
                     updates the fields and NF.
               command | getline var
                     pipes a record from command into var.

       Getline  returns  0 on end-of-file, -1 on error, otherwise
       1.

       Commands on the end of pipes are executed by /bin/sh.

       The function close(expr) closes the file or  pipe  associ-
       ated  with expr.  Close returns 0 if expr is an open file,
       the exit status if expr is a piped command, and -1  other-
       wise.   Close  is  used  to reread a file or command, make

       sure the other end of an output pipe is finished  or  con-
       serve file resources.

       The  function fflush(expr) flushes the output file or pipe
       associated with expr.  Fflush returns 0 if expr is an open
       output stream else -1.  Fflush without an argument flushes
       stdout.

       The function system(expr) uses /bin/sh to execute expr and
       returns the exit status of the command expr.  Changes made
       to the ENVIRON array are not passed to  commands  executed
       with system or pipes.

   10. User defined functions
       The syntax for a user defined function is

            function name( args ) { statements }

       The function body can contain a return statement

            return opt_expr

       A return statement is not required.  Function calls may be
       nested or recursive.  Functions are passed expressions  by
       value  and  arrays by reference.  Extra arguments serve as
       local variables and are initialized to null.  For example,
       csplit(s,A)  puts  each  character  of  s into array A and
       returns the length of s.

            function csplit(s, A,    n, i)
            {
              n = length(s)
              for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
              return n
            }

       Putting extra space between  passed  arguments  and  local
       variables  is  conventional.   Functions can be referenced
       before they are defined, but the function name and the '('
       of  the  arguments must touch to avoid confusion with con-
       catenation.

   11. Splitting strings, records and files
       Awk programs use the same algorithm to split strings  into
       arrays  with split(), and records into fields on FS.  mawk
       uses essentially the same algorithm to  split  files  into
       records on RS.

       Split(expr,A,sep) works as follows:
              (1)    If  sep  is  omitted,  it is replaced by FS.
                     Sep can be an expression or regular  expres-
                     sion.   If it is an expression of non-string
                     type, it is converted to string.
              (2)    If sep = " " (a single space), then  <SPACE>
                     is  trimmed from the front and back of expr,
                     and  sep  becomes  <SPACE>.   mawk   defines
                     <SPACE>    as    the    regular   expression
                     /[ \t\n]+/.  Otherwise sep is treated  as  a
                     regular expression, except that meta-charac-
                     ters are ignored for a string of  length  1,
                     e.g., split(x, A, "*") and split(x, A, /\*/)
                     are the same.
              (3)    If expr is not string, it  is  converted  to
                     string.   If  expr  is then the empty string
                     "", split() returns 0 and A  is  set  empty.
                     Otherwise, all non-overlapping, non-null and
                     longest matches of  sep  in  expr,  separate
                     expr  into  fields  which are loaded into A.
                     The fields are placed in  A[1],  A[2],  ...,
                     A[n]  and  split()  returns n, the number of
                     fields which is the number of  matches  plus
                     one.  Data placed in A that looks numeric is
                     typed number and string.

       Splitting records into fields works the  same  except  the
       pieces  are  loaded into $1, $2,..., $NF.  If $0 is empty,
       NF is set to 0 and all $i to "".

       mawk splits files into records by the same algorithm,  but
       with  the slight difference that RS is really a terminator
       instead of a separator.  (ORS is really a terminator too).
              E.g.,  if  FS = ":+" and $0 = "a::b:" , then NF = 3
              and $1 = "a", $2 = "b" and $3 = "", but if  "a::b:"
              is  the  contents  of  an input file and RS = ":+",
              then there are two records "a" and "b".

       RS = " " is not special.

       If FS = "", then mawk breaks the  record  into  individual
       characters, and, similarly, split(s,A,"") places the indi-
       vidual characters of s into A.

   12. Multi-line records
       Since mawk interprets RS as a regular  expression,  multi-
       line records are easy.  Setting RS = "\n\n+", makes one or
       more blank lines separate records.   If  FS  =  "  "  (the
       default),  then  single newlines, by the rules for <SPACE>
       above, become space and single newlines are field  separa-
       tors.
              For  example,  if  a  file  is  "a b\nc\n\n",  RS =
              "\n\n+" and FS = " ",  then  there  is  one  record
              "a b\nc"  with  three  fields  "a",  "b"  and  "c".
              Changing FS = "\n", gives two fields "a b" and "c";
              changing  FS = "", gives one field identical to the
              record.

       If you want lines with spaces or  tabs  to  be  considered
       blank,  set  RS = "\n([ \t]*\n)+".  For compatibility with
       other awks, setting RS = "" has  the  same  effect  as  if
       blank  lines are stripped from the front and back of files
       and then records are determined as if RS = "\n\n+".  Posix
       requires  that  "\n" always separates records when RS = ""
       regardless of the value of FS.  mawk does not support this
       convention,  because  defining  "\n"  as  <SPACE> makes it
       unnecessary.

       Most of  the  time  when  you  change  RS  for  multi-line
       records, you will also want to change ORS to "\n\n" so the
       record spacing is preserved on output.

   13. Program execution
       This section describes the  order  of  program  execution.
       First  ARGC  is  set  to  the total number of command line
       arguments passed to the execution phase  of  the  program.
       ARGV[0] is set the name of the AWK interpreter and ARGV[1]
       ...  ARGV[ARGC-1] holds the remaining command  line  argu-
       ments  exclusive of options and program source.  For exam-
       ple with

            mawk  -f  prog  v=1  A  t=hello  B

       ARGC = 5 with ARGV[0] = "mawk", ARGV[1] = "v=1", ARGV[2] =
       "A", ARGV[3] = "t=hello" and ARGV[4] = "B".

       Next,  each BEGIN block is executed in order.  If the pro-
       gram consists entirely of  BEGIN  blocks,  then  execution
       terminates,  else  an input stream is opened and execution
       continues.  If ARGC equals 1, the input stream is  set  to
       stdin,  else   the  command  line  arguments  ARGV[1]  ...
       ARGV[ARGC-1] are examined for a file argument.

       The command line arguments divide into  three  sets:  file
       arguments,  assignment arguments and empty strings "".  An
       assignment has the form var=string.  When  an  ARGV[i]  is
       examined as a possible file argument, if it is empty it is
       skipped; if it is an assignment argument,  the  assignment
       to  var takes place and i skips to the next argument; else
       ARGV[i] is opened for input.  If it fails to open,  execu-
       tion  terminates  with  exit  code  2.  If no command line
       argument is a file argument, then input comes from  stdin.
       Getline  in  a  BEGIN  action  opens input.  "-" as a file
       argument denotes stdin.

       Once an input stream is open, each input record is  tested
       against  each  pattern,  and if it matches, the associated
       action is executed.  An expression pattern matches  if  it
       is  boolean true (see the end of section 2).  A BEGIN pat-
       tern matches before any input has been read,  and  an  END

       pattern  matches  after  all input has been read.  A range
       pattern, expr1,expr2 , matches every  record  between  the
       match of expr1 and the match expr2 inclusively.

       When end of file occurs on the input stream, the remaining
       command line arguments are examined for a  file  argument,
       and  if there is one it is opened, else the END pattern is
       considered matched and all END actions are executed.

       In the example, the assignment v=1 takes place  after  the
       BEGIN  actions  are  executed, and the data placed in v is
       typed number and string.  Input is then read from file  A.
       On end of file A, t is set to the string "hello", and B is
       opened for input.  On end of file B, the END  actions  are
       executed.

       Program  flow at the pattern {action} level can be changed
       with the

            next
            exit  opt_expr

       statements.  A next statement causes the next input record
       to  be  read and pattern testing to restart with the first
       pattern {action} pair in the program.  An  exit  statement
       causes  immediate  execution of the END actions or program
       termination if there are none or if the exit occurs in  an
       END  action.  The opt_expr sets the exit value of the pro-
       gram unless overridden  by  a  later  exit  or  subsequent
       error.

EXAMPLES
       1. emulate cat.

            { print }

       2. emulate wc.

            { chars += length($0) + 1  # add one for the \n
              words += NF
            }

            END{ print NR, words, chars }

       3. count the number of unique "real words".

            BEGIN { FS = "[^A-Za-z]+" }

            { for(i = 1 ; i <= NF ; i++)  word[$i] = "" }

            END { delete word[""]
                  for ( i in word )  cnt++
                  print cnt
            }

       4. sum the second field of every record based on the first
       field.

            $1 ~ /credit|gain/ { sum += $2 }
            $1 ~ /debit|loss/  { sum -= $2 }

            END { print sum }

       5. sort a file, comparing as string

            { line[NR] = $0 "" }  # make sure of comparison type
                            # in case some lines look numeric

            END {  isort(line, NR)
              for(i = 1 ; i <= NR ; i++) print line[i]
            }

            #insertion sort of A[1..n]
            function isort( A, n,    i, j, hold)
            {
              for( i = 2 ; i <= n ; i++)
              {
                hold = A[j = i]
                while ( A[j-1] > hold )
                { j-- ; A[j+1] = A[j] }
                A[j] = hold
              }
              # sentinel A[0] = "" will be created if needed
            }

COMPATIBILITY ISSUES
       The Posix 1003.2(draft 11.3) definition of  the  AWK  lan-
       guage  is  AWK  as  described  in  the AWK book with a few
       extensions that appeared in SystemVR4 nawk. The extensions
       are:
              New functions: toupper() and tolower().
              New variables: ENVIRON[] and CONVFMT.
              ANSI  C  conversion specifications for printf() and
              sprintf().
              New command options:   -v  var=value,  multiple  -f
              options  and implementation options as arguments to
              -W.

       Posix AWK is oriented to operate on  files  a  line  at  a
       time.  RS can be changed from "\n" to another single char-
       acter, but it is hard to find any use for  this  --  there
       are  no examples in the AWK book.  By convention, RS = "",
       makes one or more blank lines separate  records,  allowing

       multi-line  records.  When RS = "", "\n" is always a field
       separator regardless of the value in FS.

       mawk, on the other hand, allows RS to be a regular expres-
       sion.   When  "\n"  appears  in  records, it is treated as
       space, and FS always determines fields.

       Removing the line at a time paradigm can  make  some  pro-
       grams  simpler  and  can  often  improve performance.  For
       example, redoing example 3 from above,

            BEGIN { RS = "[^A-Za-z]+" }

            { word[ $0 ] = "" }

            END { delete  word[ "" ]
              for( i in word )  cnt++
              print cnt
            }

       counts the number of unique words by making  each  word  a
       record.   On  moderate  size files, mawk executes twice as
       fast, because of the simplified inner loop.

       The following program replaces each comment  by  a  single
       space in a C program file,

            BEGIN {
              RS = "/\*([^*]|\*+[^/*])*\*+/"
                 # comment is record separator
              ORS = " "
              getline  hold
              }
              { print hold ; hold = $0 }
              END { printf "%s" , hold }

       Buffering  one  record  is needed to avoid terminating the
       last record with a space.

       With mawk, the following are all equivalent,

            x ~ /a\+b/    x ~ "a\+b"     x ~ "a\\+b"

       The strings get scanned twice, once as string and once  as
       regular  expression.  On the string scan, mawk ignores the
       escape on non-escape characters while the AWK  book  advo-
       cates  \c be recognized as c which necessitates the double
       escaping of meta-characters in strings.  Posix  explicitly
       declines  to  define  the  behavior which passively forces
       programs that must run under a variety of awks to use  the
       more portable but less readable, double escape.

       Posix AWK does not recognize "/dev/std{out,err}" or \x hex
       escape sequences in strings.  Unlike ANSI C,  mawk  limits
       the number of digits that follows \x to two as the current
       implementation only supports 8 bit characters.  The built-
       in  fflush  first  appeared  in  a  recent (1993) AT&T awk
       released to netlib, and is not part of the posix standard.
       Aggregate  deletion  with  delete array is not part of the
       posix standard.

       Posix explicitly leaves the behavior of FS = "" undefined,
       and  mentions  splitting  the  record into characters as a
       possible interpretation, but currently  this  use  is  not
       portable across implementations.

       Finally,  here  is  how mawk handles exceptional cases not
       discussed in the AWK book  or  the  Posix  draft.   It  is
       unsafe  to assume consistency across awks and safe to skip
       to the next section.
              substr(s, i, n) returns the characters of s in  the
              intersection  of the closed interval [1, length(s)]
              and the half-open interval  [i,  i+n).   When  this
              intersection   is   empty,   the  empty  string  is
              returned; so substr("ABC", 1,  0)  =  ""  and  sub-
              str("ABC", -4, 6) = "A".
              Every  string,  including the empty string, matches
              the empty string at the front so, s ~ //  and  s  ~
              "",  are  always  1 as is match(s, //) and match(s,
              "").  The last two set RLENGTH to 0.
              index(s, t) is always  the  same  as  match(s,  t1)
              where  t1  is  the  same  as  t with metacharacters
              escaped.  Hence  consistency  with  match  requires
              that  index(s, "") always returns 1.  Also the con-
              dition, index(s,t) != 0 if and only  t  is  a  sub-
              string of s, requires index("","") = 1.
              If  getline  encounters  end  of file, getline var,
              leaves var unchanged.  Similarly, on entry  to  the
              END actions, $0, the fields and NF have their value
              unaltered from the last record.

SEE ALSO
       egrep(1) 

       Aho, Kernighan and Weinberger, The  AWK  Programming  Lan-
       guage,  Addison-Wesley  Publishing,  1988, (the AWK book),
       defines the language, opening with a tutorial and  advanc-
       ing to many interesting programs that delve into issues of
       software design and analysis relevant  to  programming  in
       any language.

       The  GAWK Manual, The Free Software Foundation, 1991, is a

       tutorial and language reference that does not attempt  the
       depth  of  the  AWK  book  and assumes the reader may be a
       novice programmer.  The section on AWK  arrays  is  excel-
       lent.  It also discusses Posix requirements for AWK.

BUGS
       mawk  cannot  handle  ascii  NUL  \0 in the source or data
       files.  You can output NUL using printf with %c,  and  any
       other 8 bit character is acceptable input.

       mawk implements printf() and sprintf() using the C library
       functions, printf and sprintf, so full ANSI  compatibility
       requires  an ANSI C library.  In practice this means the h
       conversion qualifier may  not  be  available.   Also  mawk
       inherits any bugs or limitations of the library functions.

       Implementors of the AWK language have shown  a  consistent
       lack of imagination when naming their programs.

AUTHOR
       Mike Brennan brennan@boeing.com.

mattrib Home Page User Commands Index maze