Notes préléminaires

Selon la FAQ d'unicode.org, les pires coefficients en terme de taille requise par rapport à la chaîne d'origine sont :

NFC : 3 fois (UTF-8, UTF-16 et UTF-32)
NFD : 3 fois (UTF-8), 4 fois (UTF-16 et UTF-32)
NFKC et NFKD : 11 fois (UTF-8), 18 fois (UTF-16 et UTF-32)

Propriétés :

Un texte ASCII (non-étendu - caractères [0-0x7F] uniquement) reste inchangé quelle que ce soit la forme
Un texte ISO-8859-1 est NFC

Vérification rapide

Enumération FormeNormalisation {
   NFC,
   NFKC,
   NFD,
   NFKD
}

Enumération ResultatNormalisation {
    YES,
    NO,
    MAYBE
}

fonction getCanonicalClass(entier codePoint): entier

fonction isAllowed(entier codePoint): ResultatNormalisation


fonction quickcheck(chaîne source, FormeNormalisation forme): ResultatNormalisation
    début
        resultat <- YES
        derniereClasseCanonique <- 0

        pour chaque point de code, cp, dans source faire
            classeCanonique <- getCanonicalClass(cp)
            si derniereClasseCanonique > classeCanonique ou classeCanonique != 0 alors
                return NO
            fsi
            check <- isAllowed(cp)
            si check = NO
            alors
                return NO
            sinon
                si check = MAYBE
                alors
                    resultat <- MAYBE
                fsi
            fsi
        fpour

        retourner resultat
    fin

Pour l'implémentation de getCanonicalClass, la classe canonique s'obtient en parsant UnicodeData.txt. C'est la valeur numérique correspondant au 4ème champ (champ = découpage de la ligne sur le caractère ';'). La valeur numérique correspond à :

Valeur	?	?
0	Not_Reordered	Spacing and enclosing marks; also many vowel and consonant signs, even if nonspacing
1	Overlay	Marks which overlay a base letter or symbol
7	Nukta	Diacritic nukta marks in Brahmi-derived scripts
8	Kana_Voicing	Hiragana/Katakana voicing marks
9	Virama	Viramas
10		Start of fixed position classes
199		End of fixed position classes
200	Attached_Below_Left	Marks attached at the bottom left
202	Attached_Below	Marks attached directly below
204		Marks attached at the bottom right
208		Marks attached to the left
210		Marks attached to the right
212		Marks attached at the top left
214	Attached_Above	Marks attached directly above
216	Attached_Above_Right	Marks attached at the top right
218	Below_Left	Distinct marks at the bottom left
220	Below	Distinct marks directly below
222	Below_Right	Distinct marks at the bottom right
224	Left	Distinct marks to the left
226	Right	Distinct marks to the right
228	Above_Left	Distinct marks at the top left
230	Above	Distinct marks directly above
232	Above_Right	Distinct marks at the top right
233	Double_Below	Distinct marks subtending two bases
234	Double_Above	Distinct marks extending above two bases
240	Iota_Subscript	Greek iota subscript only

Quant à savoir si le point de code est "valide" pour la forme demandée (implémentation de isAllowed), il faut parser le fichier DerivedNormalizationProps.txt à la recherche des lignes de motif :

; NFK?[CD]_QC; [MN]

Il s'agit des propriétés de quick check (QC) pour chaque forme (NFC, NFKC, NFD, NFKD). Le M attribue la valeur Maybe au point de code ; N, la valeur NO. Les points de code qui n'y figurent pas, ont valeur Y (YES).

Dès lors, une (première) implémentation de la fonction quickcheck, basée sur des tables de hachages, ou similaire, est possible. Pour créer ses hashtables, il faudra préalablement parser les fichiers.

L'implémentation en D de isAllowed, associe l'éventuel point de code à une valeur non signée de 8 bits, étant donné qu'on a 4 formes (NFC, NFKC, NFD, NFKD) et 3 valeurs à coder pour chacune (YES, NO, MAYBE). On a ce schéma :

11 22 33 44

les bits 1 pour NFKC
les bits 2 pour NFC
les bits 3 pour NFKD
les bits 4 pour NFD

Et les valeurs :

0 (00) ⇔ YES
1 (01) ⇔ MAYBE
2 (10) ⇔ NO

Les points de code ne figurant pas dans le tableau associatif filters ont valeur YES (pour toutes les formes).

Concaténation de chaînes normalisées

X

Normalisations en elle-même

X

Lecture cohérente d'un flux

X

julp

Outils pour utilisateurs

Outils du site

Table des matières

Notes préléminaires

Vérification rapide

Concaténation de chaînes normalisées

Normalisations en elle-même

Lecture cohérente d'un flux

Outils de la page