tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
libcodecs(3), take 4
http://www.netbsd.org/~agc/codecs-20101001.tar.gz
I think I've addressed most of the issues that were brought up, and
soda-san said he was happy now the contentious charset transformations
have been removed.
In particular, the following changes have been made:
+ fixed memory mangement problems
+ then reactivated the "maybe constant" multiplier regexp
+ and activate the "free transformations on exit" functionality
+ fixup bin2hex transformation
+ add test for resolve and reverseresolve
+ manual page improvements
+ make codecs_valid_op() return a bool
+ document codecs_valid_op()
+ redid the way installed codecs are listed
+ add an unhexdump() transformation
+ add an example leet-speak conversion transformation
+ add a HOWTO
I've done some more development on libcodecs(3), and have attached
leet.c, a use of libcodecs(3) which shows how to add a transformation
to the "leet" character set. Based on the wikipedia charset, probably
not what you'd expect. There's also a HOWTO attached to this mail
which should explain things a bit more. If it doesn't, please don't
hesitate to complain.
I'm aiming to add this to the repo at the start of next week.
Regards,
Alistair
PS. To answer Yamamoto-san's point that he couldn't see any use
cases, there are lots in the standard transformations in libcodecs(3).
At the same time, there is a case for getting rid of the following
programs from base:
asa
uuencode
uudecode
perhaps vis (with some more work)
perhaps the digest programs
all the od functionality that I use (with hexdump) and more (via unhexdump)
as well as adding base64 and base85 encoding/decoding for free, and
haviong command line access to decent randomisation and zero'd areas
without resorting to using dd.
1. libcodecs(3)
libcodecs(3) is a library which provides a single framework and
interface for functions which carry out a transformation on input data
to produce data as output.
Standard transformations are provided, ranging from binary to hex and
hexdump and undump functions, to gzip, bzip2 compression, network
address resolution and reverse resolution, hash functions, message
digests, and many more.
This document shows how to add a new transformation, and how to use
the transformation in code.
2. writing a transformation
The signature for a transformation is as follows:
int transform(const char *in, const size_t insize, const char *op, void *vp,
size_t outsize);
bounded input data is provided as (in, insize), the operation to
perform is given in "op", and the transformation will be made on the
input data to give output in "vp". The number of characters in the
output is returned from the function.
For an example of such a function, please see the leet() function
in leet.c:
/* convert alphabetic chars to the leet char set -- see above */
int
toleet(const char *in, const size_t insize, const char *op, void *vp,
size_t outsize)
{
const char *cp;
size_t i;
size_t o;
char *out = (char *)vp;
for (i = 0, o = 0 ; i < insize && o < outsize - 1 ; i++) {
if (isalpha((uint8_t)in[i])) {
cp = leet[tolower((uint8_t)in[i]) - 'a'];
(void) memcpy(&out[o], cp, strlen(cp));
o += strlen(cp);
} else {
out[o++] = in[i];
}
}
out[o] = 0x0;
return (int)o;
}
3. Adding the Transformation
The libcodecs(3) library can be instantiated many times, by using
separate codecs_t tables to hold the transformations which can be
made.
So adding a transformation to a table is as simple as initialising the
storage for the table, and adding the desired transformation
function(s).
codecs_t codecs;
(void) memset(&codecs, 0x0, sizeof(codecs));
codecs_add(&codecs, "leet", toleet, "500%", 1);
The codecs_add() function is used to make the transformation available.
The first argument is the table of transformations. Multiple tables
can be used.
The second argument is a regular expression which is used to match the
transformation (this is useful for cases where more than one
transformation function is available).
The third argument is the transformation function itself. This function
will get called when the transformation framework matches the regular
expression.
The fourth argument is used to allocate the space for dynamically
allocated storage in the codecs_alloc_transform() function. This is
in the format of "percentage + constant", where percentage is the
worst case of multiple of the amount of input data needed, and the
constant is an additional number of bytes.
The fifth and final argument gives an indication whether input is
needed to the transformation function. Some transformations just fill
in output without needing any input to transform, such as randomize,
or zero, which produce random data, and zeroed out data, respectively.
4. Making the Transformation
To make the transformation, we need to use the codecs table to match
up the correct transformation, and give it the data. The simplest way
to do this is to transform the data in-place.
cc = codecs_inplace_transform(&codecs, buf, strlen(buf), "leet");
This is not always possible, since sometimes the input needs to be
preserved. If this is the case, then the storage for the output can
be allocated dynamically.
cc = codecs_alloc_transform(&codecs, buf, strlen(buf),
"leet", (void **)(void *)&out, &outsize);
(sorry about the ugly casts, please blame^U)
Alistair Crooks
Fri Oct 1 07:01:26 PDT 2010
/*-
* Copyright (c) 2010 Alistair Crooks <agc%NetBSD.org@localhost>
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
* NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include <codecs.h>
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
static const char *leet[] = {
"4",
"8",
"(",
"|)",
"3",
"|=",
"6",
"|-|",
"!",
"_|",
"X",
"1",
"/\\/\\",
"|\\|",
"0",
"|*",
"0_",
"|2",
"5",
"7",
"|_|",
"\\/",
"\\/\\/",
"%",
"j",
"2"
};
/* convert alphabetic chars to the leet char set -- see above */
int
toleet(const char *in, const size_t insize, const char *op, void *vp, size_t
outsize)
{
const char *cp;
size_t i;
size_t o;
char *out = (char *)vp;
for (i = 0, o = 0 ; i < insize && o < outsize - 1 ; i++) {
if (isalpha((uint8_t)in[i])) {
cp = leet[tolower((uint8_t)in[i]) - 'a'];
(void) memcpy(&out[o], cp, strlen(cp));
o += strlen(cp);
} else {
out[o++] = in[i];
}
}
out[o] = 0x0;
return (int)o;
}
int
main(int argc, char **argv)
{
codecs_t codecs;
char buf[BUFSIZ];
int cc;
(void) memset(&codecs, 0x0, sizeof(codecs));
codecs_add(&codecs, "leet", toleet, "500%", 1);
for (;;) {
(void) fprintf(stderr, "Leet> ");
/* thanks, yes, i know, this is superfluous */
(void) fflush(stderr);
if (fgets(buf, sizeof(buf), stdin) == NULL) {
break;
}
cc = codecs_inplace_transform(&codecs, buf, strlen(buf),
"leet");
if (cc <= 0) {
break;
}
printf("%s", buf);
}
exit(EXIT_SUCCESS);
}
LIBCODECS(3) NetBSD Library Functions Manual LIBCODECS(3)
NNAAMMEE
lliibbccooddeeccss -- string coding and decoding functions for
transforming data
LLIIBBRRAARRYY
library ``libcodecs''
SSYYNNOOPPSSIISS
##iinncclluuddee <<ccooddeeccss..hh>>
_i_n_t
ccooddeeccss__ttrraannssffoorrmm(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_i_n, _c_o_n_s_t
_s_i_z_e___t _i_n_s_i_z_e,
_c_o_n_s_t _c_h_a_r _*_o_p_e_r_a_t_i_o_n,
_v_o_i_d _*_o_u_t, _s_i_z_e___t _o_u_t_s_i_z_e);
_i_n_t
ccooddeeccss__aalllloocc__ttrraannssffoorrmm(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_i_n,
_c_o_n_s_t _s_i_z_e___t _i_n_s_i_z_e, _c_o_n_s_t
_c_h_a_r _*_o_p_e_r_a_t_i_o_n, _v_o_i_d _*_*_o_u_t_p,
_s_i_z_e___t _*_o_u_t_s_i_z_e);
_i_n_t
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _v_o_i_d _*_i_n_p_u_t, _i_n_t _s_i_z_e,
_c_o_n_s_t _c_h_a_r _*_o_p_e_r_a_t_i_o_n);
_i_n_t
ccooddeeccss__ssiizzee(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r
_*_o_p_e_r_a_t_i_o_n,
_c_o_n_s_t _s_i_z_e___t _i_n_s_i_z_e);
_i_n_t
ccooddeeccss__vvaalliidd__oopp(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_o_p);
_b_o_o_l
ccooddeeccss__iinnppuutt__nneeeeddeedd(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r
_*_o_p_e_r_a_t_i_o_n);
_i_n_t
ccooddeeccss__bbeeggiinn(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r _*_s_u_b_s_e_t,
_._._.);
_i_n_t
ccooddeeccss__lloocckkddoowwnn(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s);
_i_n_t
ccooddeeccss__aadddd(_c_o_d_e_c_s___t
_*_c_o_d_e_c_s, _c_o_n_s_t _c_h_a_r
_*_o_p_e_r_a_t_i_o_n,
_i_n_t _(_*_)_(_c_o_n_s_t _c_h_a_r _*_,
_c_o_n_s_t _s_i_z_e___t_, _c_o_n_s_t _c_h_a_r _*_,
_v_o_i_d _*_, _s_i_z_e___t_),
_c_o_n_s_t _c_h_a_r _*_m_u_l_t_i_p_l_i_e_r,
_c_o_n_s_t _b_o_o_l _i_n_p_u_t___n_e_e_d_e_d);
DDEESSCCRRIIPPTTIIOONN
lliibbccooddeeccss is a library interface which implements
various transformations
from input data to output data. Text is transformed by the
lliibbccooddeeccss
library, converting the input to the output format. New transformations
can be added to the table. The table can also be locked to prevent fur-
ther transformations being added. A lot of these transformations are
available at the system level already. However,
lliibbccooddeeccss provides a
single, consistent interface to the transformations, in a way that is
easy to provide as an interface for scripting languages and from the
shell.
The basic way of using the lliibbccooddeeccss library is to call
the
ccooddeeccss__ttrraannssffoorrmm() function to transform
the text. Two alternate func-
tions are provided,
ccooddeecc__aalllloocc__ttrraannssffoorrmm() which will
dynamically allo-
cate the space for the output array using calloc(3). In-place transfor-
mations can be made using the
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm()
function. An
``in-place'' transformation means that the transformation will be done
using temporary storage which is allocated, and then the transformed text
will be copied over the original input, thereby making the operation
appear to have transformed the text in situ.
The transformation table holding information on all the possible trans-
formations can be initialised using the
ccooddeeccss__bbeeggiinn() function. The
function can be used to limit the transformations which get loaded into
the transformation table. At the present time, the following subsets of
transformations are defined:
all will load all the following subsets of transformations
charset will load all the transformations relating to character sets,
including base64 and base85, EBCDIC, RAD50, etc.
digest will load all the transformations relating to message digests,
including md5, sha1, etc
fill will load all the transformations relating to region fill,
including zero and randomise
format will load all the transformations relating to formatting of out-
put, such as hexadecimal dumping, rotation, etc
edit will load all the transformations relating to editing of output,
such as sed and edit functionality
hash will load all the transformations relating to 32bit hashing.
network will load all the transformations relating to network name reso-
lution
It is not necessary to call this function prior to using any of the func-
tionality in the lliibbccooddeeccss library -- if the table has
not been ini-
tialised by the time of the first call, then it will be called automati-
cally.
The internal transformation information carries information on the worst-
case size of the output array. This size can be calculated using the
ccooddeeccss__ssiizzee() function, passing into the function
the size of the input
buffer. The ccooddeeccss__iinnppuutt__nneeeeddeedd()
function will return an indication
whether an input buffer is needed. Please note that an input buffer is
needed for the
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm()
transformation call. The
ccooddeeccss__vvaalliidd__oopp() function is used to verify
that the current operation
is a known transformation.
The idea behind the lliibbccooddeeccss library is that individual
transformations
are defined by a C function with a pre-set calling signature. This can
be a wrapper around existing functionality, like the digest or strvis(3)
transformations, or user provided. This transformation is added to the
table of transformations using the ccooddeeccss__aadddd()
function. Some pre-
defined transformations are provided, as explained below. The caller can
then invoke the transformation in one of three ways:
codecs_transform
by providing input data, and an area for the output of the
transformation to be placed.
codecs_alloc_transform
by providing input data, the area containing the output will be
dynamically allocated using calloc(3)
codecs_inplace_transform
in which the transformation will be made, and the output data
will be copied in place over the input data.
There are a number of pre-defined transformations provided:
asa [format] perform Fortran control character transformations
in the form of the POSIX asa(1) command.
base64decode
[charset] perform atob, or base64, decoding. Each sequence
of 4 bytes is transformed back into a 3 byte sequence.
base64encode
[charset] perform atob, or base64, encoding. Each sequence
of 3 bytes is transformed into a 4 byte sequence from the
pre-defined 64-byte set.
base85decode
[charset] perform base85 decoding. Each sequence of 5 bytes
is transformed back into a 4 byte sequence.
base85encode
[charset] perform base85 encoding. Each sequence of 4 bytes
is transformed into a 5 byte sequence from the pre-defined
85-byte set.
bin2hex [charset] encodes the input string as 4-character C-string
style hexadecimal constants.
bswap16 [format] perform a bytewise swap of the 16-bit entity
bswap32 [format] perform a bytewise swap of the 32-bit entity
bswap64 [format] perform a bytewise swap of the 64-bit entity
dos2unix [format] DOS style line-endings are transformed into Unix
style line-endings.
edit [edit] edit the input text with the ``EDITOR'' or ``VISUAL''
editor, as defined in the environment.
from-uri [charset] convert from a percent-encoded URI to ASCII text.
full-uuencode
[charset] convert the given text into uuencoded text (see
also the uuencode and uudecode transforms), adding a file
header and trailer.
gethostinfo [network] attempt to reverse resolve the hostname, given the
IP address (either IPv4 or IPv6) as input.
getipaddress
[network] attempt to resolve the IP address (both IPv4 and
IPv6) given the hostname as input.
gunzip [compress] decompress the input buffer using zlib(3)
gzip [compress] compress the input buffer using zlib(3)
hex2bin [charset] decodes the input string from 4-character C-string
style hexadecimal constants to binary output.
hexdump [format] converts the input text to an ASCII-clean hexadeci-
mal dump format, including a printable representation of the
input text.
list [fill] lists the available codecs in the current instance.
md5 [digest] calculate the MD5 digest using MD5_Data(3)
metaphone [charset] calculate the metaphone phonetic value for the
input.
rad50decode [charset] converts the input text from DEC RADIX-50 format
to the original text. Due to the limited range of the
RADIX-50 character set, some of the original text may have
been lost.
rad50encode [charset] converts the input text to DEC RADIX-50 format
from the original text. Due to the limited range of the
RADIX-50 character set, some of the original text may have
been lost.
randomise [fill] fill the output with random values.
rmd160 [digest] calculate the RMD160 digest using RMD160_Data(3)
rot [format] transform the input text with a circular rotation.
The most famous of these is the Caesar rot13(6) transforma-
tion, but this transformation allows any length of rotation
to be used.
secs2str [format] transforms the input value (as the ASCII-encoded
decimal value of seconds since the start of the epoch) to a
colon-separated value representing the date.
sed [edit] performs a sed(1) transformation on a regular expres-
sion. Please note that full, extended regular expressions,
as defined in re_format(7) are used to match.
size [digest] returns the size of the input as a decimal string
sha1 [digest] calculate the SHA1 digest using SHA1Data(3)
sha256 [digest] calculate the SHA256 digest using SHA256_Data(3)
sha512 [digest] calculate the SHA512 digest using SHA512_Data(3)
soundex [charset] calculate the soundex phonetic value for the
input.
str2secs [format] transforms the input value (as the colon-separated
value representing the date) to an ASCII-encoded decimal
value representing seconds since the start of the epoch.
strunvis [charset] uses the unstrvis(3) transformation on the input
data.
strvis [charset] uses the strvis(3) transformation on the input
data.
strvisc [charset] uses the strvisc(3) transformation on the input
data.
substring [edit] extract a substring of the input string, and place it
in the output string.
to-uri [charset] convert from a percent-encoded URI to ASCII text.
to-lower [charset] change any uppercase letters in the input string
to lowercase.
to-upper [charset] change any lowercase letters in the input string
to uppercase.
unhexdump [format] converts the input text from the ASCII-clean hexa-
decimal dump format, created by the hexdump transformation,
back to its original binary form.
unix2dos [charset] the Unix-style line-endings are converted to DOS
style line-endings.
uudecode [charset] transform the input text from uudecode(1) text to
the original text.
uuencode [charset] encode the input text as uuencode(1) text.
zero [fill] produce an area containing NUL bytes in the output.
A number of hash functions have also been implemented, namely:
dumbhash [hash] implements a simple hashing scheme based on the
addition of the value of each character in the string.
dumbmulhash [hash] implements a simple hashing scheme based on the
addition of the value of each character in the string mul-
tiplied by its position in the string.
lennart [hash] implements a simple and fast generic string hasher
based on Peter K. Pearson's article in CACM 33-6, pp. 677.
crchash [hash] implements a hash used in CRC calculations
perlhash [hash] implements the addition-based hash algorithm used
internally in the perl interpreter.
perlxorhash [hash] implements the XOR-based hash algorithm used inter-
nally in the perl interpreter.
pythonhash [hash] implements the hash algorithm used internally in
the python interpreter.
mousehash [hash] implements an XOR-based hash algorithm from der
Mouse.
bernstein [hash] implements a multiplicative-based hash algorithm
from Daniel Bernstein.
honeyman [hash] implements an XOR-based hash algorithm from Peter
Honeyman.
pjwhash [hash] implements the so called `hashpjw' function by P.J.
Weinberger from Aho/Sethi/Ullman, COMPILERS: Principles,
Techniques and Tools, 1986, 1987 Bell Telephone Laborato-
ries, Inc.
bobhash [hash] implements another, more complex hash algorithm.
torekhash [hash] implements a hash algorithm due to Chris Torek, and
using Duff's device.
byacchash [hash] implements the hash function found in Berkeley
byacc(1) program
tclhash [hash] implements the hash algorithm used internally in
the tcl interpreter.
gawkhash [hash] implements the hash algorithm used internally in
the gawk interpreter, also using Duff's device.
gcc3_hash [hash] implements one of the hash algorithms found in gcc3
gcc3_hash2 [hash] implements another of the hash algorithms found in
gcc3
nemhash [hash] implements another hash function
RREETTUURRNN VVAALLUUEESS
On a successful transformation, the
ccooddeeccss__ttrraannssffoorrmm()
ccooddeecc__aalllloocc__ttrraannssffoorrmm() and
ccooddeeccss__iinnppllaaccee__ttrraannssffoorrmm()
functions return
the actual number of bytes in the output transformation. On a successful
initialisation, ccooddeeccss__bbeeggiinn() will return a value
of 1. The
ccooddeeccss__ssiizzee() function returns the number of bytes
which will be needed
to contain the given transformation with the given size of input bytes.
SSEEEE AALLSSOO
asa(1), sed(1), uudecode(1), uuencode(1), calloc(3), MD5Data(3),
RMD160Data(3), SHA1Data(3), SHA256_Data(3), SHA512_Data(3), strvis(3),
strvisc(3), unstrvis(3), zlib(3), rot13(6), re_format(7)
HHIISSTTOORRYY
The lliibbccooddeeccss library first appeared in NetBSD 6.0.
AAUUTTHHOORRSS
Alistair Crooks <agc%NetBSD.org@localhost>
NetBSD 5.0 September 30, 2010 NetBSD 5.0
Home |
Main Index |
Thread Index |
Old Index