Overview

The tdda package provides Python support for test-driven data analysis (see 1-page summary with references, or the blog, or the book).

  • A white spiral with white text, assertStringCorrect above it, on an orange background.

    The tdda.referencetest library is used to support the creation of reference tests, based on either unittest or pytest.

    • Semantic Equivalence. Reference tests allow checking outputs for semantic equivalence rather than identical objects. Tests can be parameterized to specify what is allowed to vary.

    • File-based Comparisons. Strings and data frames can be compared against reference outputs in files. When tests fail, the in-memory datastructures are written to file and suitable diff commands proposed.

    • Automatic Rewriting of Reference Results. When behaviour is updated, reference results in file can be selectively rewritten after verification.

    • Tagging of Tests. Tests and test classes can be tagged, and run selectively without naming them on the command line. The list of failing tests from tdda.referencetest runs can also be logged automatically, and used to tag failing tests for selective re-running.

  • A machine with an input hopper with a label "INSERT CODE HERE" and an output chute with a label "COLLECT TESTS HERE". The image is mostly in red with gears in the body of the machine.

    The tdda gentest utility uses tdda.referencetest to generate tests for any command-line command (scripts and programs in any language). These tests can check a variety of kinds of output, including output to stdout, to stderr, and to files, and can cope with some variation in outputs (semantic correctness).

    .

  • A machine with an input hopper with a label "INSERT DATA HERE" and an output chute with a label "COLLECT CONSTRAINTS AND VERIFICATIONS HERE". The image is mostly in blue with gears in the body of the machine.

    The tdda.constraints library and command-line tools are used to

    • discover constraints from a (Pandas) DataFrame, and write them out as JSON (tdda discover command).

    • verify that datasets meet the constraints in the constraints file (tdda verify command).

    • detect individual records/values that fail to meet the constraints (tdda detect command).

    As well as data frames in parquet files and flat (CSV) files, it also supports tables in a variety of relational databases without extraction. There is also a command-line utility for discovering and verifying constraints, and detecting failing records.

  • A machine with an input hopper with a label "INSERT STRINGS HERE" and an output chute with a label "COLLECT REGEX HERE". The image is mostly in green with gears in the body of the machine.

    The tdda.rexpy library is a tool for automatically inferring regular expressions from a column in a Pandas DataFrame or from a (Python) list of examples. There is also a command-line utility for Rexpy (rexpy command).

    .

    .

  • A table showing four columns in two pairs for the same field in two datasets. There are four data rows. Where the data is the same, the values are white. Where they differ, the values from the first dataset are blue and those from the second dataset are red. The background is black.

    The tdda diff tool can compare data frames in Parquet files and/or flat files and report differences in a visual format a bit like a visual diff for text files. Command-line options and other specifiers control what differences are reported, and how.

    .

    .

  • The outline of a JSON object with text replaced with simple bars as placeholders. Teal background with white text.

    The tdda.serial format allows the documentation of CSV and other flat-file formats in use in the companion .serial file for more accurate and portable reading and writing of flat files. The tdda serial command provides functionality for reading and writing tdda.serial metadata, as well as for reading and writing CSVW and Frictionless metadata, and for converting between these various metadata formats.

  • The tdda utility functions provide tools for normalization and measurement of Unicode text. This includes:

    • Normal Form TK. Normalization functions for Normal Form TK, an extension of Unicode Compatibility Normalization (normal forms KC and KD).

    • Glyph counting (as opposed to character counting) in strings.

    • RFC 9839 support. Implementation of RFC 9839's recommendations for removal or replacement of deprecated Unicode characters in input data.

Although the library is provided as a Python package, and can be called through its Python API, it also provides command-line tools, which are installed when the module is installed.