Test-Driven Data Analysis (Python TDDA library)

Version 3.0.01. (Installation)

The TDDA module helps with the testing of data and of code that manipulates data.

The major components of the TDDA module are:

Top Line: Machines illustrating the constraint discovering functionality, which takes data in and produces constraints as output; rexpy, which takes strings in and produces regular expressions as output, and gentest, which takes code in and produces tests as output. Bottom Line: tdda diff (data-frame comparison), tdda.serial (metadata for CSV and other flat files), and tdda.ReferenceTest, for semantic testing of complex results.
  • Data Validation and Automatic Constraint Generation: The package includes command-line tools and API calls for

    • discovery of constraints that are satisfied by (example) data — tdda discover;

    • verification that a dataset satisfies a set of constraints. The constraints can have been generated automatically, constructed manually, or (most commonly) consist of generated constraints that have been subsequently refined by hand — tdda verify;

    • detection of records, fields and values that fail to satisify constraints (anomaly detection) — tdda detect.

    Supported data sources include parquet files, database tables, and flat (CSV) files.

  • Reference Testing: The TDDA library offers extensions to unittest and pytest for managing the testing of data analysis pipelines, where the results are typically much larger, and more complex, and more variable than for many other sorts of programs.

  • Inference of Regular Expressions from Examples: There is a command-line tool (and API) for automatically inferring regular expressions from (structured) textual data — rexpy. This was developed as part of constraint generation, but has broader utility.

  • Automatic Test Generation: The TDDA library includes the ability to generate tests for almost any command-line based program or script. The code to be tested can take the form of a shell script or any other command-line code, and can be written in any language or mix of languages.

  • Metadata tools for Flat Files: The tdda.serial modules and tdda serial command assist with more reliable reading and writing of flat files (CSV etc.) using metadata, both in tdda‘s own format, and with the CSVW (.csvw) and Frictionless formats.

  • DataFrame Diff utilities: A new tdda diff utility allows parameterized difference detection between data frames stored in parquet files and flat file.

The tdda library serves as a concrete implementation of the ideas discussed in:

When installed, the module offers a suite of command-line tools that can be used with data from any source, not just Python. It also provides enhanced test methods for Python code, and the Gentest functionality enables automatic generation of test programs for arbitrary code (not just Python code). There is also a full Python API for all functionality.

Test-driven data analysis is closely related to reproducible research, but with more of a focus on automated testing. It is best seen as overlapping and partly complementary to reproducible research.

Contents

Resources