libcodecs(3), take 4

To: tech-userlevel%netbsd.org@localhost
Subject: libcodecs(3), take 4
From: Alistair Crooks <agc%pkgsrc.org@localhost>
Date: Sat, 2 Oct 2010 05:11:59 +0200

        http://www.netbsd.org/~agc/codecs-20101001.tar.gz

I think I've addressed most of the issues that were brought up, and
soda-san said he was happy now the contentious charset transformations
have been removed.

In particular, the following changes have been made:

+ fixed memory mangement problems
+ then reactivated the "maybe constant" multiplier regexp
+ and activate the "free transformations on exit" functionality
+ fixup bin2hex transformation
+ add test for resolve and reverseresolve
+ manual page improvements
+ make codecs_valid_op() return a bool
+ document codecs_valid_op()
+ redid the way installed codecs are listed
+ add an unhexdump() transformation
+ add an example leet-speak conversion transformation
+ add a HOWTO

I've done some more development on libcodecs(3), and have attached
leet.c, a use of libcodecs(3) which shows how to add a transformation
to the "leet" character set. Based on the wikipedia charset, probably
not what you'd expect. There's also a HOWTO attached to this mail
which should explain things a bit more. If it doesn't, please don't
hesitate to complain.

I'm aiming to add this to the repo at the start of next week.

Regards,
Alistair

PS.  To answer Yamamoto-san's point that he couldn't see any use
cases, there are lots in the standard transformations in libcodecs(3). 
At the same time, there is a case for getting rid of the following
programs from base:

asa
uuencode
uudecode
perhaps vis (with some more work)
perhaps the digest programs
all the od functionality that I use (with hexdump) and more (via unhexdump)

as well as adding base64 and base85 encoding/decoding for free, and
haviong command line access to decent randomisation and zero'd areas
without resorting to using dd.

1. libcodecs(3)

libcodecs(3) is a library which provides a single framework and
interface for functions which carry out a transformation on input data
to produce data as output.

Standard transformations are provided, ranging from binary to hex and
hexdump and undump functions, to gzip, bzip2 compression, network
address resolution and reverse resolution, hash functions, message
digests, and many more.

This document shows how to add a new transformation, and how to use
the transformation in code.


2. writing a transformation

The signature for a transformation is as follows:

int transform(const char *in, const size_t insize, const char *op, void *vp, 
size_t outsize);

bounded input data is provided as (in, insize), the operation to
perform is given in "op", and the transformation will be made on the
input data to give output in "vp".  The number of characters in the
output is returned from the function.

For an example of such a function, please see the leet() function
in leet.c:

        /* convert alphabetic chars to the leet char set -- see above */
        int
        toleet(const char *in, const size_t insize, const char *op, void *vp, 
size_t outsize)
        {
                const char      *cp;
                size_t           i;
                size_t           o;
                char            *out = (char *)vp;

                for (i = 0, o = 0 ; i < insize && o < outsize - 1 ; i++) {
                        if (isalpha((uint8_t)in[i])) {
                                cp = leet[tolower((uint8_t)in[i]) - 'a'];
                                (void) memcpy(&out[o], cp, strlen(cp));
                                o += strlen(cp);
                        } else {
                                out[o++] = in[i];
                        }
                }
                out[o] = 0x0;
                return (int)o;
        }


3. Adding the Transformation

The libcodecs(3) library can be instantiated many times, by using
separate codecs_t tables to hold the transformations which can be
made.

So adding a transformation to a table is as simple as initialising the
storage for the table, and adding the desired transformation
function(s).

        codecs_t         codecs;

        (void) memset(&codecs, 0x0, sizeof(codecs));
        codecs_add(&codecs, "leet", toleet, "500%", 1);

The codecs_add() function is used to make the transformation available.

The first argument is the table of transformations. Multiple tables
can be used.

The second argument is a regular expression which is used to match the
transformation (this is useful for cases where more than one
transformation function is available).

The third argument is the transformation function itself. This function
will get called when the transformation framework matches the regular
expression.

The fourth argument is used to allocate the space for dynamically
allocated storage in the codecs_alloc_transform() function. This is
in the format of "percentage + constant", where percentage is the
worst case of multiple of the amount of input data needed, and the
constant is an additional number of bytes.

The fifth and final argument gives an indication whether input is
needed to the transformation function.  Some transformations just fill
in output without needing any input to transform, such as randomize,
or zero, which produce random data, and zeroed out data, respectively.

4. Making the Transformation

To make the transformation, we need to use the codecs table to match
up the correct transformation, and give it the data.  The simplest way
to do this is to transform the data in-place.

        cc = codecs_inplace_transform(&codecs, buf, strlen(buf), "leet");

This is not always possible, since sometimes the input needs to be
preserved. If this is the case, then the storage for the output can
be allocated dynamically.

        cc = codecs_alloc_transform(&codecs, buf, strlen(buf),
                        "leet", (void **)(void *)&out, &outsize);

(sorry about the ugly casts, please blame^U)

Alistair Crooks
Fri Oct  1 07:01:26 PDT 2010

/*-
 * Copyright (c) 2010 Alistair Crooks <agc%NetBSD.org@localhost>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
 * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
 * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
 * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */
#include <codecs.h>
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

static const char *leet[] = {
        "4",
        "8",
        "(",
        "|)",
        "3",
        "|=",
        "6",
        "|-|",
        "!",
        "_|",
        "X",
        "1",
        "/\\/\\",
        "|\\|",
        "0",
        "|*",
        "0_",
        "|2",
        "5",
        "7",
        "|_|",
        "\\/",
        "\\/\\/",
        "%",
        "j",
        "2"
};

/* convert alphabetic chars to the leet char set -- see above */
int
toleet(const char *in, const size_t insize, const char *op, void *vp, size_t 
outsize)
{
        const char      *cp;
        size_t           i;
        size_t           o;
        char            *out = (char *)vp;

        for (i = 0, o = 0 ; i < insize && o < outsize - 1 ; i++) {
                if (isalpha((uint8_t)in[i])) {
                        cp = leet[tolower((uint8_t)in[i]) - 'a'];
                        (void) memcpy(&out[o], cp, strlen(cp));
                        o += strlen(cp);
                } else {
                        out[o++] = in[i];
                }
        }
        out[o] = 0x0;
        return (int)o;
}

int
main(int argc, char **argv)
{
        codecs_t         codecs;
        char             buf[BUFSIZ];
        int              cc;

        (void) memset(&codecs, 0x0, sizeof(codecs));
        codecs_add(&codecs, "leet", toleet, "500%", 1);
        for (;;) {
                (void) fprintf(stderr, "Leet> ");
                /* thanks, yes, i know, this is superfluous */
                (void) fflush(stderr);
                if (fgets(buf, sizeof(buf), stdin) == NULL) {
                        break;
                }
                cc = codecs_inplace_transform(&codecs, buf, strlen(buf), 
"leet");
                if (cc <= 0) {
                        break;
                }
                printf("%s", buf);
        }
        exit(EXIT_SUCCESS);
}

LIBCODECS(3)            NetBSD Library Functions Manual           LIBCODECS(3)

NNAAMMEE
     lliibbccooddeeccss -- string coding and decoding functions for 
transforming data

LLIIBBRRAARRYY
     library ``libcodecs''

SSYYNNOOPPSSIISS
     ##iinncclluuddee <<ccooddeeccss..hh>>

     _i_n_t
     ccooddeeccss__ttrraannssffoorrmm(_c_o_d_e_c_s___t 
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_i_n, _c_o_n_s_t 
_s_i_z_e___t _i_n_s_i_z_e,
         _c_o_n_s_t _c_h_a_r _*_o_p_e_r_a_t_i_o_n, 
_v_o_i_d _*_o_u_t, _s_i_z_e___t _o_u_t_s_i_z_e);

     _i_n_t
     
ccooddeeccss__aalllloocc__ttrraannssffoorrmm(_c_o_d_e_c_s___t
 _*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_i_n,
         _c_o_n_s_t _s_i_z_e___t _i_n_s_i_z_e, _c_o_n_s_t 
_c_h_a_r _*_o_p_e_r_a_t_i_o_n, _v_o_i_d _*_*_o_u_t_p,
         _s_i_z_e___t _*_o_u_t_s_i_z_e);

     _i_n_t
     
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm(_c_o_d_e_c_s___t
 _*_c_o_d_e_c_s, _v_o_i_d _*_i_n_p_u_t, _i_n_t _s_i_z_e,
         _c_o_n_s_t _c_h_a_r _*_o_p_e_r_a_t_i_o_n);

     _i_n_t
     ccooddeeccss__ssiizzee(_c_o_d_e_c_s___t 
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r 
_*_o_p_e_r_a_t_i_o_n,
         _c_o_n_s_t _s_i_z_e___t _i_n_s_i_z_e);

     _i_n_t
     ccooddeeccss__vvaalliidd__oopp(_c_o_d_e_c_s___t 
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_o_p);

     _b_o_o_l
     
ccooddeeccss__iinnppuutt__nneeeeddeedd(_c_o_d_e_c_s___t
 _*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r 
_*_o_p_e_r_a_t_i_o_n);

     _i_n_t
     ccooddeeccss__bbeeggiinn(_c_o_d_e_c_s___t 
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_s_u_b_s_e_t, 
_._._.);

     _i_n_t
     ccooddeeccss__lloocckkddoowwnn(_c_o_d_e_c_s___t 
_*_c_o_d_e_c_s);

     _i_n_t
     ccooddeeccss__aadddd(_c_o_d_e_c_s___t 
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r 
_*_o_p_e_r_a_t_i_o_n,
         _i_n_t _(_*_)_(_c_o_n_s_t _c_h_a_r _*_, 
_c_o_n_s_t _s_i_z_e___t_, _c_o_n_s_t _c_h_a_r _*_, 
_v_o_i_d _*_, _s_i_z_e___t_),
         _c_o_n_s_t _c_h_a_r _*_m_u_l_t_i_p_l_i_e_r, 
_c_o_n_s_t _b_o_o_l _i_n_p_u_t___n_e_e_d_e_d);

DDEESSCCRRIIPPTTIIOONN
     lliibbccooddeeccss is a library interface which implements 
various transformations
     from input data to output data.  Text is transformed by the 
lliibbccooddeeccss
     library, converting the input to the output format.  New transformations
     can be added to the table.  The table can also be locked to prevent fur-
     ther transformations being added.  A lot of these transformations are
     available at the system level already.  However, 
lliibbccooddeeccss provides a
     single, consistent interface to the transformations, in a way that is
     easy to provide as an interface for scripting languages and from the
     shell.

     The basic way of using the lliibbccooddeeccss library is to call 
the
     ccooddeeccss__ttrraannssffoorrmm() function to transform 
the text.  Two alternate func-
     tions are provided, 
ccooddeecc__aalllloocc__ttrraannssffoorrmm() which will 
dynamically allo-
     cate the space for the output array using calloc(3).  In-place transfor-
     mations can be made using the 
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm() 
function.  An
     ``in-place'' transformation means that the transformation will be done
     using temporary storage which is allocated, and then the transformed text
     will be copied over the original input, thereby making the operation
     appear to have transformed the text in situ.

     The transformation table holding information on all the possible trans-
     formations can be initialised using the 
ccooddeeccss__bbeeggiinn() function.  The
     function can be used to limit the transformations which get loaded into
     the transformation table.  At the present time, the following subsets of
     transformations are defined:

     all      will load all the following subsets of transformations

     charset  will load all the transformations relating to character sets,
              including base64 and base85, EBCDIC, RAD50, etc.

     digest   will load all the transformations relating to message digests,
              including md5, sha1, etc

     fill     will load all the transformations relating to region fill,
              including zero and randomise

     format   will load all the transformations relating to formatting of out-
              put, such as hexadecimal dumping, rotation, etc

     edit     will load all the transformations relating to editing of output,
              such as sed and edit functionality

     hash     will load all the transformations relating to 32bit hashing.

     network  will load all the transformations relating to network name reso-
              lution

     It is not necessary to call this function prior to using any of the func-
     tionality in the lliibbccooddeeccss library -- if the table has 
not been ini-
     tialised by the time of the first call, then it will be called automati-
     cally.

     The internal transformation information carries information on the worst-
     case size of the output array.  This size can be calculated using the
     ccooddeeccss__ssiizzee() function, passing into the function 
the size of the input
     buffer.  The ccooddeeccss__iinnppuutt__nneeeeddeedd() 
function will return an indication
     whether an input buffer is needed.  Please note that an input buffer is
     needed for the 
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm() 
transformation call.  The
     ccooddeeccss__vvaalliidd__oopp() function is used to verify 
that the current operation
     is a known transformation.

     The idea behind the lliibbccooddeeccss library is that individual 
transformations
     are defined by a C function with a pre-set calling signature.  This can
     be a wrapper around existing functionality, like the digest or strvis(3)
     transformations, or user provided.  This transformation is added to the
     table of transformations using the ccooddeeccss__aadddd() 
function.  Some pre-
     defined transformations are provided, as explained below.  The caller can
     then invoke the transformation in one of three ways:

     codecs_transform
              by providing input data, and an area for the output of the
              transformation to be placed.

     codecs_alloc_transform
              by providing input data, the area containing the output will be
              dynamically allocated using calloc(3)

     codecs_inplace_transform
              in which the transformation will be made, and the output data
              will be copied in place over the input data.

     There are a number of pre-defined transformations provided:

     asa          [format] perform Fortran control character transformations
                  in the form of the POSIX asa(1) command.

     base64decode
                  [charset] perform atob, or base64, decoding.  Each sequence
                  of 4 bytes is transformed back into a 3 byte sequence.

     base64encode
                  [charset] perform atob, or base64, encoding.  Each sequence
                  of 3 bytes is transformed into a 4 byte sequence from the
                  pre-defined 64-byte set.

     base85decode
                  [charset] perform base85 decoding.  Each sequence of 5 bytes
                  is transformed back into a 4 byte sequence.

     base85encode
                  [charset] perform base85 encoding.  Each sequence of 4 bytes
                  is transformed into a 5 byte sequence from the pre-defined
                  85-byte set.

     bin2hex      [charset] encodes the input string as 4-character C-string
                  style hexadecimal constants.

     bswap16      [format] perform a bytewise swap of the 16-bit entity

     bswap32      [format] perform a bytewise swap of the 32-bit entity

     bswap64      [format] perform a bytewise swap of the 64-bit entity

     dos2unix     [format] DOS style line-endings are transformed into Unix
                  style line-endings.

     edit         [edit] edit the input text with the ``EDITOR'' or ``VISUAL''
                  editor, as defined in the environment.

     from-uri     [charset] convert from a percent-encoded URI to ASCII text.

     full-uuencode
                  [charset] convert the given text into uuencoded text (see
                  also the uuencode and uudecode transforms), adding a file
                  header and trailer.

     gethostinfo  [network] attempt to reverse resolve the hostname, given the
                  IP address (either IPv4 or IPv6) as input.

     getipaddress
                  [network] attempt to resolve the IP address (both IPv4 and
                  IPv6) given the hostname as input.

     gunzip       [compress] decompress the input buffer using zlib(3)

     gzip         [compress] compress the input buffer using zlib(3)

     hex2bin      [charset] decodes the input string from 4-character C-string
                  style hexadecimal constants to binary output.

     hexdump      [format] converts the input text to an ASCII-clean hexadeci-
                  mal dump format, including a printable representation of the
                  input text.

     list         [fill] lists the available codecs in the current instance.

     md5          [digest] calculate the MD5 digest using MD5_Data(3)

     metaphone    [charset] calculate the metaphone phonetic value for the
                  input.

     rad50decode  [charset] converts the input text from DEC RADIX-50 format
                  to the original text. Due to the limited range of the
                  RADIX-50 character set, some of the original text may have
                  been lost.

     rad50encode  [charset] converts the input text to DEC RADIX-50 format
                  from the original text. Due to the limited range of the
                  RADIX-50 character set, some of the original text may have
                  been lost.

     randomise    [fill] fill the output with random values.

     rmd160       [digest] calculate the RMD160 digest using RMD160_Data(3)

     rot          [format] transform the input text with a circular rotation.
                  The most famous of these is the Caesar rot13(6) transforma-
                  tion, but this transformation allows any length of rotation
                  to be used.

     secs2str     [format] transforms the input value (as the ASCII-encoded
                  decimal value of seconds since the start of the epoch) to a
                  colon-separated value representing the date.

     sed          [edit] performs a sed(1) transformation on a regular expres-
                  sion. Please note that full, extended regular expressions,
                  as defined in re_format(7) are used to match.

     size         [digest] returns the size of the input as a decimal string

     sha1         [digest] calculate the SHA1 digest using SHA1Data(3)

     sha256       [digest] calculate the SHA256 digest using SHA256_Data(3)

     sha512       [digest] calculate the SHA512 digest using SHA512_Data(3)

     soundex      [charset] calculate the soundex phonetic value for the
                  input.

     str2secs     [format] transforms the input value (as the colon-separated
                  value representing the date) to an ASCII-encoded decimal
                  value representing seconds since the start of the epoch.

     strunvis     [charset] uses the unstrvis(3) transformation on the input
                  data.

     strvis       [charset] uses the strvis(3) transformation on the input
                  data.

     strvisc      [charset] uses the strvisc(3) transformation on the input
                  data.

     substring    [edit] extract a substring of the input string, and place it
                  in the output string.

     to-uri       [charset] convert from a percent-encoded URI to ASCII text.

     to-lower     [charset] change any uppercase letters in the input string
                  to lowercase.

     to-upper     [charset] change any lowercase letters in the input string
                  to uppercase.

     unhexdump    [format] converts the input text from the ASCII-clean hexa-
                  decimal dump format, created by the hexdump transformation,
                  back to its original binary form.

     unix2dos     [charset] the Unix-style line-endings are converted to DOS
                  style line-endings.

     uudecode     [charset] transform the input text from uudecode(1) text to
                  the original text.

     uuencode     [charset] encode the input text as uuencode(1) text.

     zero         [fill] produce an area containing NUL bytes in the output.

     A number of hash functions have also been implemented, namely:

     dumbhash       [hash] implements a simple hashing scheme based on the
                    addition of the value of each character in the string.

     dumbmulhash    [hash] implements a simple hashing scheme based on the
                    addition of the value of each character in the string mul-
                    tiplied by its position in the string.

     lennart        [hash] implements a simple and fast generic string hasher
                    based on Peter K. Pearson's article in CACM 33-6, pp. 677.

     crchash        [hash] implements a hash used in CRC calculations

     perlhash       [hash] implements the addition-based hash algorithm used
                    internally in the perl interpreter.

     perlxorhash    [hash] implements the XOR-based hash algorithm used inter-
                    nally in the perl interpreter.

     pythonhash     [hash] implements the hash algorithm used internally in
                    the python interpreter.

     mousehash      [hash] implements an XOR-based hash algorithm from der
                    Mouse.

     bernstein      [hash] implements a multiplicative-based hash algorithm
                    from Daniel Bernstein.

     honeyman       [hash] implements an XOR-based hash algorithm from Peter
                    Honeyman.

     pjwhash        [hash] implements the so called `hashpjw' function by P.J.
                    Weinberger from Aho/Sethi/Ullman, COMPILERS: Principles,
                    Techniques and Tools, 1986, 1987 Bell Telephone Laborato-
                    ries, Inc.

     bobhash        [hash] implements another, more complex hash algorithm.

     torekhash      [hash] implements a hash algorithm due to Chris Torek, and
                    using Duff's device.

     byacchash      [hash] implements the hash function found in Berkeley
                    byacc(1) program

     tclhash        [hash] implements the hash algorithm used internally in
                    the tcl interpreter.

     gawkhash       [hash] implements the hash algorithm used internally in
                    the gawk interpreter, also using Duff's device.

     gcc3_hash      [hash] implements one of the hash algorithms found in gcc3

     gcc3_hash2     [hash] implements another of the hash algorithms found in
                    gcc3

     nemhash        [hash] implements another hash function

RREETTUURRNN VVAALLUUEESS
     On a successful transformation, the 
ccooddeeccss__ttrraannssffoorrmm()
     ccooddeecc__aalllloocc__ttrraannssffoorrmm() and 
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm() 
functions return
     the actual number of bytes in the output transformation.  On a successful
     initialisation, ccooddeeccss__bbeeggiinn() will return a value 
of 1.  The
     ccooddeeccss__ssiizzee() function returns the number of bytes 
which will be needed
     to contain the given transformation with the given size of input bytes.

SSEEEE AALLSSOO
     asa(1), sed(1), uudecode(1), uuencode(1), calloc(3), MD5Data(3),
     RMD160Data(3), SHA1Data(3), SHA256_Data(3), SHA512_Data(3), strvis(3),
     strvisc(3), unstrvis(3), zlib(3), rot13(6), re_format(7)

HHIISSTTOORRYY
     The lliibbccooddeeccss library first appeared in NetBSD 6.0.

AAUUTTHHOORRSS
     Alistair Crooks <agc%NetBSD.org@localhost>

NetBSD 5.0                    September 30, 2010                    NetBSD 5.0

Follow-Ups:
- Re: libcodecs(3), take 4
  - From: YAMAMOTO Takashi
- Re: libcodecs(3), take 4
  - From: Masao Uebayashi

Prev by Date: Re: Proposed addition of strcodecs(3) library - review requested
Next by Date: Re: libcodecs(3), take 4
Previous by Thread: libcodecs(3), take 3
Next by Thread: Re: libcodecs(3), take 4
Indexes:

Home | Main Index | Thread Index | Old Index