TDDA Utility Functions

tdda.utils

String Normalization

tdda.utils.nftk(s)

Normalize s to TDDA Normal Form TKC (NFKC base, accents stripped).

Equivalent to nftkc(s). Applies Unicode Compatibility Normalization KC, maps various quotes, dashes, and similar characters to ASCII equivalents, and strips combining diacritical marks.

The difference between TKC and TKD is canonical composition vs. decomposition after compatibility normalization. They differ mainly for characters with combining marks, such as some Hangul syllable blocks (e.g. 가).

Parameters:

s (str) – String to normalize.

Returns:

Normalized string in TK form.

tdda.utils.nftkd(s)

Normalize s to TDDA Normal Form TKD (NFKD base, accents stripped).

Applies Unicode Compatibility Normalization KD, maps various quotes, dashes, and similar characters to ASCII equivalents, and strips combining diacritical marks.

The difference between TKC and TKD is canonical composition vs. decomposition after compatibility normalization. They differ mainly for characters with combining marks, such as some Hangul syllable blocks (e.g. 가).

Parameters:

s (str) – String to normalize.

Returns:

Normalized string in TKD form.

tdda.utils.normal_form_tk(s, remove_accents=True, strip=False, standardize_space=False, nfkd=False)

Map s to TDDA Normal Form TK (NFTK) with options for KC or KD and also some whitespace normalization options.

NFTK applies Unicode Compatibility Normalization (NFKC or NFKD) plus additional mappings for commonly confused characters, with optional accent removal and space normalization.

Extra mappings beyond NFKC/NFKD include:
  • Dashes and minus signs → ASCII hyphen-minus

  • Curly quotes and apostrophes → ASCII ' and "

  • Tab and a few other whitespace forms → space

  • Some combined characters like œ → oe

Parameters:
  • s (str) – String to normalize.

  • remove_accents (bool) – Strip combining diacritical marks (default True).

  • strip (bool) – Strip leading and trailing whitespace (default False).

  • standardize_space (bool) – Collapse runs of spaces to a single space (default False).

  • nfkd (bool) – Use NFKD base form instead of the default NFKC (default False).

Returns:

Normalized string.

Note

Unless the whitespace handling is required, the short form nftk() should normally be used (or nftkd() if decomposed form is required in edge cases).

Unicode Glyph Counting

tdda.utils.n_glyphs(s)

Return the number of user-perceived glyphs (grapheme clusters) in s.

RFC 9839 Character Handling

tdda.utils.handle_rfc9839_forbiddens(text, delete=True)

Remove or replace RFC 9839 forbidden characters from text.

Forbidden characters are:
  • Surrogates: U+D800–U+DFFF

  • C0 controls except tab (U+09), LF (U+0A), CR (U+0D): U+00–U+1F

  • DEL and C1 controls: U+7F–U+9F

  • Noncharacters: U+FDD0–U+FDEF (32 chars) and U+xFFFE/U+xFFFF for all 17 Unicode planes (34 chars)

Parameters:
  • text (str) – Text to clean.

  • delete (bool) – If True (default), remove forbidden characters. If False, replace them with U+FFFD (REPLACEMENT CHARACTER).

Returns:

Cleaned text with forbidden characters removed or replaced.

tdda.utils.check_unicode_assignables(text, field_name)

Return warnings for RFC 9839 forbidden characters found in text.

Checks for but does not reject characters outside the Unicode Assignables subset: surrogates, C0 controls (except tab/LF/CR), DEL and C1 controls, and noncharacters.

Parameters:
  • text (str) – Text to check.

  • field_name (str) – Label used in warning messages.

Returns:

List of warning strings, one per category of problematic character found, or an empty list if the text is clean.