2. A basic conversion consists of from and to phases. Search of codec name is
case insensitive.
ISO-8859-1 : UTF-8
from to
Figure 1: Basic two phases conversion
Between from and to phases, we can have an inter phase.
UTF-8 : UPPER : UTF-8
from inter to
Figure 2: Conversion with inter-mapping phase
There can be more than one inter phases.
UTF-8 : UPPER : FULL : UTF-8
from inter inter to
Figure 3: Conversion with multiple inter-mapping phases
An inter phase can be used standalonely, mostly in programmatic way.
HALF
inter
Figure 4: Standalone inter-mapping phase
Conversions can be cascaded with pipe symbol. In most cases it is equivalent
to shell pipe unless the use of codecs manipulating flag (described in section
2.2).
UTF-8 : BIG5 | BIG5 : UTF-8
from to from to
Figure 5: Cascaded conversions
ASCII-compatible codecs are designed to exclude ASCII part and named as
FOO, with alias FOO ⇒ FOO,ASCII or ASCII, FOO.
2
3. 1.2 Codecs & Fallback
A phase consists of one or more codecs, separated by comma. The latter
codecs will be utilized if and only if the former codecs fail to consume the
incoming data, once a codec finish its task, the first codec will be up again for
upcoming data.
UTF-8 : ASCII , 3F
from to
Figure 6: Fallback codec
1.3 Codec argument
Some codecs take arguments, after the hash symbol.
UTF-8 : ASCII , ANY#3F
Figure 7: Passing argument to codec
Some codecs take arguments in key-value form. Argument name and value
consist of numbers, alphabets, hyphen and underscore, binary data are repre-
sented in hexadecimal form.
UTF-8 : ASCII , ESCAPE#PREFIX=2575
Figure 8: Passing argument to codec in key-value form
Multiple arguments can be passed by being concatenated with ampersand.
UTF-8 : ASCII , ESCAPE#PREFIX=262378&SUFFIX=3B
Figure 9: Passing multiple arguments to codec
List of data can be passed in dot-separated form.
ANY#013F.0121 : ASCII
Figure 10: Data list
3
4. 2 Type & Flag
2.1 Type
A code point packet note its type at first byte.
ID Description Provider(from) Consumer(to)
00 Bsdconv special characters BSDCONV-KEYWORD BSDCONV-KEYWORD
01 Unicode Most decoders Most encoders
02 CNS116431
CNS11643 CNS11643
03 Byte BYTE; ESCAPE BYTE; ESCAPE#FOR=BYTE
04 Chinese components inter/ZH-DECOMP inter/ZH-COMP
1B ANSI control sequence ANSI-CONTROL -
Table 1: Types and its provider/consumer (just to name a few)
Entity Unicode UTF-8 Hex
% U+0025 25
A U+0041 41
∀ U+2200 E28880
A∀
Input (UTF-8 literal)
ASCII,BYTE : ...
Decoder
01
41
03
E2
03
88
03
80
Internal data
... : ASCII,ESCAPE
Encoder
41
”A”
25
45
32
”%E2”
25
38
38
”%88”
25
38
30
”%80”
Internal data
A%E2%88%80
Output (UTF-8 literal)
Figure 11: Fallback & Type
1As for the intersection of CNS11643 and Unicode, from/CNS11643 does conversion to
unicode type if possible. Vice versa, to/CNS11643 does conversion from unicode type if
possible.
4
5. 2.2 Flag
A code point packet carries its own flags. Currently there are two types of
flag, FREE and MARK. Flag FREE indicates that the packet buffer needs
to be recycled or released, this is used only when programming is involved.
Flag MARK is (currently only) added by codec to/PASS#MARK and used
by codec from/PASS#UNMARK to identify which packets have already been
decoded and needs to be passed through in from phase.
The code point packets structure is retained, including flags, within cascaded
conversions, but not for shell pipe. Figure 11 demonstrate the flow of conversion
ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8”.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
%u03B1%CE%B2
Input (UTF-8 literal)
ESCAPE : ...
Decoder
01
03
B1
03
CE
03
B2
Internal data
... : PASS#MARK&FOR=1,BYTE
Encoder
01
03
B1
MARK
CE B2
Internal data
PASS#UNMARK,UTF-8 : ...
Decoder
01
03
B1
01
03
B2
Internal data
... : UTF-8
Encoder
CE
B1
”α”
CE
B2
”β”
Internal data
αβ
Output (UTF-8 literal)
Figure 12: Flag, from/PASS & to/PASS
5
6. 2.3 Helper codecs
Codec from/bsdconv can be used to input internal data structure, and codec
to/BSDCONV-OUTPUT can be used to inspect type and flags.
3 C Programming guide
3.1 Conversion instance lifecycle
bsdconv create()
bsdconv init()
set input/output parameters
is last chunk set flush flag
bsdconv()
collect output
has next chunk
bsdconv destroy()
yes
no
no
yes
next chunk
no
reuse instance
Figure 13: Conversion instance lifecycle
6
7. 3.2 Skeleton
#include <bsdconv.h>
bsdconv_instance *ins;
char *buf;
size_t len;
ins=bsdconv_create ("UTF -8: UPSIDEDOWN:UTF -8");
bsdconv_init(ins);
do{
buf=bsdconv_malloc (BUFSIZ );
/*
* fill data into buf
* len=filled data length
*/
ins ->input.data=buf;
ins ->input.len=len;
ins ->input.flags |= F_FREE;
ins ->input.next=NULL;
if(ins ->input.len ==0)
{ // last chunk
ins ->flush =1;
}
/*
* set output parameter (see section 3.3)
*/
bsdconv(ins);
/*
* collect output (see section 3.3)
*/
}while(ins ->flush ==0);
bsdconv_destroy (ins);
For chunked conversion, input buffer should be allocated for each input to
prevent content change during conversion. Output buffer with flag FREE is
safe to be reused.
3.3 Output mode
ins -> output mode Description
BSDCONV HOLD Hold output in memory
BSDCONV AUTOMALLOC Return output buffer which should be free() after use
BSDCONV PREMALLOCED Fill output into given buffer
BSDCONV FILE Write output into (FILE *) stream file
BSDCONV FD Write output into (int) file descriptor
BSDCONV NULL Discard output
BSDCONV PASS Pass to another conversion instance
7
8. 3.3.1 BSDCONV HOLD
This is default output mode after bsdconv init(). Usually used with BSD-
CONV AUTOMALLOC or BSDCONV PREMALLOCED to get squeezed out-
put.
3.3.2 BSDCONV AUTOMALLOC
Output buffer will be allocated dynamically, the actual buffer size will be
ins->output.len + output content length, it is useful when you need to have
terminating null byte.
3.3.3 BSDCONV PREMALLOCED
If ins->output.data is NULL, the total length of content to be output will
be put to ins->output.len, but output will still be hold in memory. Otherwise,
bsdconv() will fill as much unfragmented data as possible within the buffer size
limit specified at ins->output.len.
3.3.4 BSDCONV FILE
Output will be fwrite() to the given FILE * at ins->output.data.
3.3.5 BSDCONV FD
Output will be write() to the given (int) file descriptor at ins->output.data.
Casting to intptr t (defined in <stdint.h>) is needed to eliminate compiler
warning.
3.3.6 BSDCONV NULL
Output will be discard. This is usually used with evaluating conversion (see
section 3.4).
3.3.7 BSDCONV PASS
Output packets will be passed to the given (struct bsdconv instance *) con-
version instance at ins->output.data.
3.4 Counters
Counters are listed in ins->counter in linked-list with following structure.
struct bsdconv_counter_entry {
char *key;
bsdconv_counter_t val;
struct bsdconv_counter_entry *next;
};
IERR and OERR are mandatory error counters.
8
9. There are two APIs to get/reset counter(s):
bsdconv_counter_t * bsdconv_counter (char *name );
Return the pointer to the counter value. bsdconv counter t is currently defined
as size t.
void bsdconv_counter_reset (char *name );
Reset the specified counter, if name is NULL, all counters are reset.
3.5 Memory pool issue
In case libbsdconv and your program uses different memory pools, bsdconv malloc()
and bsdconv free() should be used to replace malloc() and free().
9