Recent Changes
This Version
3.0 Detect functionality now works with databases
3.0
tdda difffor visually comparing data frames (Parquet & CSV files)3.0 Partial support for polars data frames
3.0 Extra reporting options for
tdda discoverandtdda detect(tables, HTML, Text, Markdown, JSON, YAML, TOML)3.0
tdda.serial: metadata format for describing flat files including:Support for writing
pandas.read_csvargumentsSupport for writing
polars.read_csvargumentsSupport for loading flat files to
pandasSupport for loading flat files to
polarsSupport for reading and writing frictionless files
Support for reading and writing CSVW files
Experimental support for inferring metadata format
Experimental support for translating between all
tdda serialsupported formats
3.0 Remove deprecated
.feathersupport (replaced by parquet).3.0 Configuration file for controlling some behaviours via
.tomlfile3.0 Support for grouped regular expressions in
rexpyand colouring3.0
nfktnormalization intdda.utils3.0
dict_to_texfunction intdda.utilsfor test-driven document development (TDDA) with TeX and LaTeX3.0
tdda tagcommand and-F/--log-failuresflag fortdda.referencetest: running tests with-Fwrites the fully-qualified names of any failing tests to a file;tdda tagthen adds@tagdecorators to those tests so they can be re-run in isolation with-1. The-9/--untagflag removes all@tagdecorators on tests run.3.0
allowedandrequiredfields in constraints files:discovernow records which fields were present when constraints were generated.verifyanddetectcan then flag fields that are missing (required) or unexpected (not allowed).3.0
tdda configcommand for inspecting the active configuration:tdda configortdda config--currentshows the effective configtdda config--defaultshows built-in default valuestdda config--fileshows only values set in the config filetdda config--annotatedannotates output with allowed values
3.0 Improved
pytestsupport fortdda.referencetest:--write-all,--tagged, and-9/--untagflags now work withpytestvia aconftest.pymodule, in addition to the existingunittest-based support.3.0 Improved encoding detection in
tdda.serial: when reading flat files, encoding is now detected usingchardetwith fallback through UTF-8, UTF-8-SIG, UTF-16, and Latin-1, with automatic promotion from Latin-1 to CP1252 where appropriate.3.0 Some changes to switch names (
--write-allto--write-all-records)3.0 Less over-entitization in HTML output
3.0 Improved packaging support (thanks NSK!)
Previous Versions
2.2 Improvements to parquet file handling.
2.2 Have not (as threatened) removed feather file support yet but will shortly, possibly even before 2.3, but a deprecation warning has been added that shows when feather files are used.
2.2 Added parquet files to various of the examples that users get with
tdda examples2.2 Fixed problem with categorical strings from parquet.
2.2 Now use chardet to figure out (infer/guess) encodings in
gentest2.2 Added partial support for CSVW metadata (for CSV files) and some tests and test data in CSVW format.
2.2 Extended support for writing temporary files when tests fail from strings/text files to dataframes, CSV files and Parquet files. This also means that the dataframe methods can now re-write reference results using
-W/--write-alletc.2.2 Renamed some methods and parameters for DataFrame assertions and comparisons. In particular:
assertOnDiskDataFrameCorrectreplaces assertCSVFileCorrect, with the path name now beingref_pathrather thanref_csv. The old method remains, and calls the new method. The new method works with parquet files as well as with CSV files.assertOnDiskDataFramesCorrectreplaces assertCSVFilesCorrect, with the path name now beingref_pathsrather thanref_csvs. The old method remains, and calls the new method. The new method works with parquet files as well as with CSV files.
2.2 Better reporting of differences between data frames when tests fail or comparisons show differences.
2.2 Added experimental
tdda diffcommand for comparing data frames serialized as parquet or CSV files.2.2 Add rich dependency and use rich to format dataframe diffs.
2.2 Fixed bug in flag parsing that prevented multiple single-character flags to be used separately, rather than combined. (So
-1Wworked but-1 -Wdid not.)2.2 Fixed bug in the metadata written in constraints files. The local and utc times were supposed to be written in ISO8601 format, but repeated %H in the format string instead of using %M. Switched to use
.isoformat(), and accepted its defaultTseparator in the datestamps, rather than sticking with space.2.2 Quite a lot of internal refactoring, making parameters and methods names more consistent, and better suited to a wider variety of file formats and back-end implementations.
2.1 Upgrade pandas dependency to 2.0 and significantly improve compatibility with Pandas 2.0+.
2.1 Add support for parquet files for input and output data, (particularly for constraint generation, verification, and detection). New dependency on
pyarrowto support this.2.1 Deprecate use of
.featherfiles. Support will be removed in a future version, no earlier than 2.2.2.1 Inference of date formats: the TDDA library now uses its own methods to infer date formats, as Pandas no longer supports this.
2.1 Experimental support for CSV metadata specification files. This is unstable, not fully documented, and subject to change.
2.0.8 and 2.0.9 Fixed to IP address lookup in
gentest.2.0 Addition of Gentest—functionality for automatically generating Python test code for any command-line program
2.0 Major overhaul of documentation.
More descriptive documentation
Better (though incomplete) separation between user code (particularly the command-line utilities
tdda gentest,tdda discover,tdda verify,tdda detectandrexpy).Add more external links to resources and fix those that had rusted
Improve the CSS to make the documentation render better on tdda.readthedocs.io
Adopt a customized version of the readthedocs theme for the documentation everywhere, so that what you see if you build the documentation locally should be more similar to what you see at tdda.readthedocs.io
2.0 Significant changes to the algorithm used by Rexpy. Should now be faster, but potentially more stochastic.
2.0 Rexpy can now generate many different flavours of regular expressions.
2.0 Planned Deprecation We plan to move from using
.featherfiles to.parquetfiles in the 2.1 release, at which point.featherfiles will immediately be deprecated.
Older Versions
Reference test exercises added.
Escaping of special characters for regular expressions is now done in a way that is uniform across Python2, Python pre-3.7, and Python 3.7+.
JSON is now generated the same for Python2 and Python3 (no blank lines at the end of lines, and UTF8-encoded).
Fixed issue with
tdda testcommand not working properly in the previous version, to self-test an installation.Added new option flag
--interleavefortdda detect. This causes the_okdetection fields to be interleaved with the original fields that they refer to in the resulting detection dataset, rather than all appearing together at the far right hand side. This option was actually present in the previous release, but not sufficiently documented.Fix for the
--write-allparameter fortdda.referencetestresult regeneration, which had regressed slightly in the previous version.Improved reporting of differences for text files in
tdda.referencetestwhen the actual results do not match the expected file contents. Now fully takes account of theignoreandremoveparameters.The
ignore_patternsparameter inassertTextFileCorrect()(and others) intdda.referencetestnow causes only the portion of a line that matches the regular expressions to be ignored; anything else on the line (before or after the part that matches a regular expression) must be identical in the actual and expected results. This means that you are specifying the part of the line that is allowed to differ, rather than marking an entire line to be ignored. This is a change in functionality, but is what had always been intended. For fuller control (and to get the previous behaviour), you can anchor the expressions with^.*(...).*$, and then they will apply to the entire line.The
ignore_patternsparameter intdda.referencetestcan now accept grouped subexpressions in regular expressions. This allows use of alternations, which were previously not supported.The
ignore_substringsparameter inassertTextFileCorrect()(and others) intdda.referencetestnow only matches lines in the expected file (where you have full control over what will appear there), not in the actual file. This fixes a problem with differences being masked (and not reported as problems) if the actual happened to include unexpected matching content on lines other than where intended.The
tdda.constraintspackage is now more resilient against unexpected type mismatches. Previously, if the type didn’t match, then in some circumstances exceptions would be (incorrectly) raised for other constraints, rather than failures.The
tdda.constraintspackage now supports Pythondatetime.datefields in Pandas DataFrames, in addition to the existing support ofdatetime.datetime.The
tdda.constraintsPython API now provides support for in-memory constraints, by allowing Python dictionaries to be passed in toverify_dfanddetect_df, as an alternative to passing in a.tddafilename. This allows an application using the library to store its constraints however it wants to, rather than having to use the filesystem (e.g. storing it online and fetching with an HTTPGET).The
tdda.constraintspackage can now access MySQL databases using the mysql.connector driver, in addition to the MySQLdb and mysqlclient drivers.The
tdda.rexpytool can now quote the regular expressions it produces, with the new--quoteoption flag. This makes it easier to copy the expressions to use them on the command line, or embed them in strings in many programming languages.The Python API now allows you to
import tddaand then refer to its subpackages viatdda.referencetest,tdda.constraintsortdda.rexpy. Previously you had to explicitly import each submodule separately.