# Overview The `tdda` package provides Python support for test-driven data analysis (see [1-page summary](http://stochasticsolutions.com/pdf/TDDA-One-Pager.pdf) with references, or the [blog](http://www.tdda.info/pages/table-of-contents.html#table-of-contents), or the [book](https://book.tdda.info)). * :::{image} image/reftest.png :width: 150px :align: right :alt: A white spiral with white text, assertStringCorrect above it, on an orange background. ::: The {py:mod}`tdda.referencetest` library is used to support the creation of *reference tests*, based on either `unittest` or `pytest`. - *Semantic Equivalence.* Reference tests allow checking outputs for semantic equivalence rather than identical objects. Tests can be parameterized to specify what is allowed to vary. - *File-based Comparisons.* Strings and data frames can be compared against reference outputs in files. When tests fail, the in-memory datastructures are written to file and suitable `diff` commands proposed. - *Automatic Rewriting of Reference Results.* When behaviour is updated, reference results in file can be selectively rewritten after verification. - *Tagging of Tests.* Tests and test classes can be tagged, and run selectively without naming them on the command line. The list of failing tests from `tdda.referencetest` runs can also be logged automatically, and used to tag failing tests for selective re-running. * :::{image} image/gentest.png :width: 150px :align: right :alt: A machine with an input hopper with a label "INSERT CODE HERE" and an output chute with a label "COLLECT TESTS HERE". The image is mostly in red with gears in the body of the machine. ::: The [`tdda gentest`](gentest) utility uses {py:mod}`tdda.referencetest` to generate tests for any command-line command (scripts and programs in any language). These tests can check a variety of kinds of output, including output to `stdout`, to `stderr`, and to files, and can cope with some variation in outputs (*semantic* correctness). ~~.~~ * :::{image} image/discover-verify.png :width: 150px :align: right :alt: A machine with an input hopper with a label "INSERT DATA HERE" and an output chute with a label "COLLECT CONSTRAINTS AND VERIFICATIONS HERE". The image is mostly in blue with gears in the body of the machine. ::: The {py:mod}`tdda.constraints` library and command-line tools are used to - *discover* constraints from a (Pandas) DataFrame, and write them out as JSON ([`tdda discover`](cli.md#tdda-discover) command). - *verify* that datasets meet the constraints in the constraints file ([`tdda verify`](cli.md#tdda-verify) command). - *detect* individual records/values that fail to meet the constraints ([`tdda detect`](cli.md#tdda-detect) command). As well as data frames in parquet files and flat (CSV) files, it also supports tables in a variety of relational databases without extraction. There is also a command-line utility for discovering and verifying constraints, and detecting failing records. * :::{image} image/rexpy.png :width: 150px :align: right :alt: A machine with an input hopper with a label "INSERT STRINGS HERE" and an output chute with a label "COLLECT REGEX HERE". The image is mostly in green with gears in the body of the machine. ::: The [`tdda.rexpy`](#tdda.rexpy.rexpy) library is a tool for automatically inferring regular expressions from a column in a Pandas DataFrame or from a (Python) list of examples. There is also a command-line utility for Rexpy ([`rexpy`](cli.md#rexpy) command). ~~.~~ ~~.~~ * :::{image} image/tddadiff.png :width: 150px :align: right :alt: A table showing four columns in two pairs for the same field in two datasets. There are four data rows. Where the data is the same, the values are white. Where they differ, the values from the first dataset are blue and those from the second dataset are red. The background is black. ::: The [`tdda diff`](cli.md#tdda-diff) tool can compare data frames in Parquet files and/or flat files and report differences in a visual format a bit like a visual diff for text files. Command-line options and other specifiers control what differences are reported, and how. ~~.~~ ~~.~~ * :::{image} image/tddaserial.png :width: 150px :align: right :alt: The outline of a JSON object with text replaced with simple bars as placeholders. Teal background with white text. ::: The [`tdda.serial`](cli.md#tdda-serial) format allows the documentation of CSV and other flat-file formats in use in the companion `.serial` file for more accurate and portable reading and writing of flat files. The [`tdda serial`](cli.md#tdda-serial) command provides functionality for reading and writing `tdda.serial` metadata, as well as for reading and writing [CSVW](https://csvw.org) and [Frictionless](https://frictionlessdata.io) metadata, and for converting between these various metadata formats. * The [`tdda` utility functions](utils-api.md) provide tools for normalization and measurement of Unicode text. This includes: - [*Normal Form TK.*](utils-api.md#string-normalization) Normalization functions for Normal Form TK, an extension of Unicode Compatibility Normalization (normal forms KC and KD). - [*Glyph counting*](utils-api.md#unicode-glyph-counting) (as opposed to character counting) in strings. - [*RFC 9839 support.*](utils-api.md#rfc-9839-character-handling) Implementation of RFC 9839's recommendations for removal or replacement of deprecated Unicode characters in input data. Although the library is provided as a Python package, and can be called through its Python API, it also provides command-line tools, which are installed when the module is installed.