Recent Changes

This Version

3.0 Detect functionality now works with databases
3.0 tdda diff for visually comparing data frames (Parquet & CSV files)
3.0 Partial support for polars data frames
3.0 Extra reporting options for tdda discover and tdda detect (tables, HTML, Text, Markdown, JSON, YAML, TOML)
3.0 tdda.serial: metadata format for describing flat files including:
- Support for writing pandas.read_csv arguments
- Support for writing polars.read_csv arguments
- Support for loading flat files to pandas
- Support for loading flat files to polars
- Support for reading and writing frictionless files
- Support for reading and writing CSVW files
- Experimental support for inferring metadata format
- Experimental support for translating between all tdda serial supported formats
3.0 Remove deprecated .feather support (replaced by parquet).
3.0 Configuration file for controlling some behaviours via .toml file
3.0 Support for grouped regular expressions in rexpy and colouring
3.0 nfkt normalization in tdda.utils
3.0 dict_to_tex function in tdda.utils for test-driven document development (TDDA) with TeX and LaTeX
3.0 tdda tag command and -F/--log-failures flag for tdda.referencetest: running tests with -F writes the fully-qualified names of any failing tests to a file; tdda tag then adds @tag decorators to those tests so they can be re-run in isolation with -1. The -9/--untag flag removes all @tag decorators on tests run.
3.0 allowed and required fields in constraints files: discover now records which fields were present when constraints were generated. verify and detect can then flag fields that are missing (required) or unexpected (not allowed).
3.0 tdda config command for inspecting the active configuration:
- tdda config or tdda config --current shows the effective config
- tdda config --default shows built-in default values
- tdda config --file shows only values set in the config file
- tdda config --annotated annotates output with allowed values
3.0 Improved pytest support for tdda.referencetest: --write-all, --tagged, and -9/--untag flags now work with pytest via a conftest.py module, in addition to the existing unittest-based support.
3.0 Improved encoding detection in tdda.serial: when reading flat files, encoding is now detected using chardet with fallback through UTF-8, UTF-8-SIG, UTF-16, and Latin-1, with automatic promotion from Latin-1 to CP1252 where appropriate.
3.0 Some changes to switch names (--write-all to --write-all-records)
3.0 Less over-entitization in HTML output
3.0 Improved packaging support (thanks NSK!)

Previous Versions

2.2 Improvements to parquet file handling.
2.2 Have not (as threatened) removed feather file support yet but will shortly, possibly even before 2.3, but a deprecation warning has been added that shows when feather files are used.
2.2 Added parquet files to various of the examples that users get with tdda examples
2.2 Fixed problem with categorical strings from parquet.
2.2 Now use chardet to figure out (infer/guess) encodings in gentest
2.2 Added partial support for CSVW metadata (for CSV files) and some tests and test data in CSVW format.
2.2 Extended support for writing temporary files when tests fail from strings/text files to dataframes, CSV files and Parquet files. This also means that the dataframe methods can now re-write reference results using -W/--write-all etc.
2.2 Renamed some methods and parameters for DataFrame assertions and comparisons. In particular:
- assertOnDiskDataFrameCorrect replaces assertCSVFileCorrect, with the path name now being ref_path rather than ref_csv. The old method remains, and calls the new method. The new method works with parquet files as well as with CSV files.
- assertOnDiskDataFramesCorrect replaces assertCSVFilesCorrect, with the path name now being ref_paths rather than ref_csvs. The old method remains, and calls the new method. The new method works with parquet files as well as with CSV files.
2.2 Better reporting of differences between data frames when tests fail or comparisons show differences.
2.2 Added experimental tdda diff command for comparing data frames serialized as parquet or CSV files.
2.2 Add rich dependency and use rich to format dataframe diffs.
2.2 Fixed bug in flag parsing that prevented multiple single-character flags to be used separately, rather than combined. (So -1W worked but -1 -W did not.)
2.2 Fixed bug in the metadata written in constraints files. The local and utc times were supposed to be written in ISO8601 format, but repeated %H in the format string instead of using %M. Switched to use .isoformat(), and accepted its default T separator in the datestamps, rather than sticking with space.
2.2 Quite a lot of internal refactoring, making parameters and methods names more consistent, and better suited to a wider variety of file formats and back-end implementations.
2.1 Upgrade pandas dependency to 2.0 and significantly improve compatibility with Pandas 2.0+.
2.1 Add support for parquet files for input and output data, (particularly for constraint generation, verification, and detection). New dependency on pyarrow to support this.
2.1 Deprecate use of .feather files. Support will be removed in a future version, no earlier than 2.2.
2.1 Inference of date formats: the TDDA library now uses its own methods to infer date formats, as Pandas no longer supports this.
2.1 Experimental support for CSV metadata specification files. This is unstable, not fully documented, and subject to change.
2.0.8 and 2.0.9 Fixed to IP address lookup in gentest.
2.0 Addition of Gentest—functionality for automatically generating Python test code for any command-line program
2.0 Major overhaul of documentation.
- More descriptive documentation
- Better (though incomplete) separation between user code (particularly the command-line utilities tdda gentest, tdda discover, tdda verify, tdda detect and rexpy).
- Add more external links to resources and fix those that had rusted
- Improve the CSS to make the documentation render better on tdda.readthedocs.io
- Adopt a customized version of the readthedocs theme for the documentation everywhere, so that what you see if you build the documentation locally should be more similar to what you see at tdda.readthedocs.io
2.0 Significant changes to the algorithm used by Rexpy. Should now be faster, but potentially more stochastic.
2.0 Rexpy can now generate many different flavours of regular expressions.
2.0 Planned Deprecation We plan to move from using .feather files to .parquet files in the 2.1 release, at which point .feather files will immediately be deprecated.

Older Versions

Reference test exercises added.
Escaping of special characters for regular expressions is now done in a way that is uniform across Python2, Python pre-3.7, and Python 3.7+.
JSON is now generated the same for Python2 and Python3 (no blank lines at the end of lines, and UTF8-encoded).
Fixed issue with tdda test command not working properly in the previous version, to self-test an installation.
Added new option flag --interleave for tdda detect. This causes the _ok detection fields to be interleaved with the original fields that they refer to in the resulting detection dataset, rather than all appearing together at the far right hand side. This option was actually present in the previous release, but not sufficiently documented.
Fix for the --write-all parameter for tdda.referencetest result regeneration, which had regressed slightly in the previous version.
Improved reporting of differences for text files in tdda.referencetest when the actual results do not match the expected file contents. Now fully takes account of the ignore and remove parameters.
The ignore_patterns parameter in assertTextFileCorrect() (and others) in tdda.referencetest now causes only the portion of a line that matches the regular expressions to be ignored; anything else on the line (before or after the part that matches a regular expression) must be identical in the actual and expected results. This means that you are specifying the part of the line that is allowed to differ, rather than marking an entire line to be ignored. This is a change in functionality, but is what had always been intended. For fuller control (and to get the previous behaviour), you can anchor the expressions with ^.*(...).*$, and then they will apply to the entire line.
The ignore_patterns parameter in tdda.referencetest can now accept grouped subexpressions in regular expressions. This allows use of alternations, which were previously not supported.
The ignore_substrings parameter in assertTextFileCorrect() (and others) in tdda.referencetest now only matches lines in the expected file (where you have full control over what will appear there), not in the actual file. This fixes a problem with differences being masked (and not reported as problems) if the actual happened to include unexpected matching content on lines other than where intended.
The tdda.constraints package is now more resilient against unexpected type mismatches. Previously, if the type didn’t match, then in some circumstances exceptions would be (incorrectly) raised for other constraints, rather than failures.
The tdda.constraints package now supports Python datetime.date fields in Pandas DataFrames, in addition to the existing support of datetime.datetime.
The tdda.constraints Python API now provides support for in-memory constraints, by allowing Python dictionaries to be passed in to verify_df and detect_df, as an alternative to passing in a .tdda filename. This allows an application using the library to store its constraints however it wants to, rather than having to use the filesystem (e.g. storing it online and fetching with an HTTP GET).
The tdda.constraints package can now access MySQL databases using the mysql.connector driver, in addition to the MySQLdb and mysqlclient drivers.
The tdda.rexpy tool can now quote the regular expressions it produces, with the new --quote option flag. This makes it easier to copy the expressions to use them on the command line, or embed them in strings in many programming languages.
The Python API now allows you to import tdda and then refer to its subpackages via tdda.referencetest, tdda.constraints or tdda.rexpy. Previously you had to explicitly import each submodule separately.