Test-Driven Data Analysis (Python TDDA library)
Version 3.0.04. (Installation)
The TDDA module helps with the testing of data and of code that manipulates data.
The major components of the TDDA module are:
Data Validation and Automatic Constraint Generation: The package includes command-line tools and API calls for
discovery of constraints that are satisfied by (example) data —
tdda discover;verification that a dataset satisfies a set of constraints. The constraints can have been generated automatically, constructed manually, or (most commonly) consist of generated constraints that have been subsequently refined by hand —
tdda verify;detection of records, fields and values that fail to satisify constraints (anomaly detection) —
tdda detect.
Supported data sources include parquet files, database tables, and flat (CSV) files.
Reference Testing: The TDDA library offers extensions to
unittestandpytestfor managing the testing of data analysis pipelines, where the results are typically much larger, and more complex, and more variable than for many other sorts of programs.Inference of Regular Expressions from Examples: There is a command-line tool (and API) for automatically inferring regular expressions from (structured) textual data —
rexpy. This was developed as part of constraint generation, but has broader utility.Automatic Test Generation: The TDDA library includes the ability to generate tests for almost any command-line based program or script. The code to be tested can take the form of a shell script or any other command-line code, and can be written in any language or mix of languages.
Metadata tools for Flat Files: The
tdda.serialmodules andtdda serialcommand assist with more reliable reading and writing of flat files (CSV etc.) using metadata, both intdda‘s own format, and with the CSVW (.csvw) and Frictionless formats.DataFrame Diff utilities: A new
tdda diffutility allows parameterized difference detection between data frames stored in parquet files and flat file.
The tdda library serves as a concrete implementation of the ideas
discussed in:
Test-Driven Data Analysis, by Nicholas J. Radcliffe, CRC Press (book, available from all good booksellers and all sellers of good books).
the Test-Driven Data Analysis blog.
When installed, the module offers a suite of command-line tools that can be used with data from any source, not just Python. It also provides enhanced test methods for Python code, and the Gentest functionality enables automatic generation of test programs for arbitrary code (not just Python code). There is also a full Python API for all functionality.
Test-driven data analysis is closely related to reproducible research, but with more of a focus on automated testing. It is best seen as overlapping and partly complementary to reproducible research.
Contents
- Overview
- Installation
- Data Validation with Constraints
- Gentest: Automatic Test Generation for Unix & Linux Commands/Scripts
- Reference Tests
- Rexpy
- tdda diff: Display Differences in Tabular Data (DataFrames and Flat Files)
tdda.serial: Metadata and Tools for Flat (“CSV”) Files- The
tdda.serialFile Structure - Reading Data Using
tdda.serialFiles and other Metadata Specifications - Writing Data with
tdda.serial(API) - Generating
tdda.serialFiles by Inference from a Flat File - Converting Between
tdda.serial, CSVW, and Frictionless - Custom Sections within
tdda.serialFiles - Using
tdda.serialMetadata in thetddaLibrary
- The
- TDDA's Constraints API
- TDDA's API for Rexpy
- TDDA Serial API
- TDDA Utility Functions
- Command Line Reference
- Configuration
- Microsoft Windows Configuration
- Tests
- Examples
- Recent Changes