# tdda diff: Display Differences in Tabular Data (DataFrames and Flat Files) ## The Problem [`tdda diff`](cli.md#tdda-diff) Addresses [`tdda diff`](cli.md#tdda-diff) aims to be a visual diff for tabular data. To explain: - The traditional Unix/Linux `diff` command (and `git diff`) show lines that are different between two files, usually with either `<` to indicate the left (first) file and `>` to indicate the right (second). - For example: diff bat.txt lamb.txt produces $ diff bat.txt lamb.txt 1,2c1,2 < Mary had a little bat < its wings were black you know --- > Mary had a little lamb > its fleece was white as snow. 4c4 < that bat was sure to go. --- > that lamb was sure to go. when `lamb.txt` contains: Mary had a little lamb its fleece was white as snow. And everywhere that Mary went that lamb was sure to go. and `bat.txt` contains Mary had a little bat its wings were black you know And everywhere that Mary went that bat was sure to go. The output is actually a `patch`-format script with commands that transform the left-hand side into the right-hand side, which is where the numbers come in. The lines with `<` are from `bat.txt` and the ones with `>` are from `lamb.txt`. - ‘Visual’ diff tools are similar but usually show the whole of each file side by side, highlighting the specific parts of lines that are different and allowing the two to be scrolled “in sync” even when there are blocks of lines only in one or the other. They often use colour as well (as does `git diff`). For example, this is the output from the Mac's `opendiff` visual diff tool with the same two files. ![Opendiff Output for bat.txt and lamb.txt](image/opendiff-bat-lamb.png) - If the two files are identical, `diff` produces no output (and exits with code 0), and `opendiff` says "No differences" (and exits with code 0). - [`tdda diff`](cli.md#tdda-diff) uses the functionality for comparing DataFrames in reference testing to provide a capability somewhere between these for datasets stored as either parquet files or flat files such as CSV files. It also exposes some of the extra options provided by referencetest's `assertDataFramesEquivalent` and similar methods. For example, if we use a pair of CSV files that are very similar to the text files we used, with a header line and commas instead of spaces between the words (because all the lines have six comma-separated values, including two blanks) and use [`tdda diff`](cli.md#tdda-diff), we get this: ![TDDA Diff Output for bat.csv and lamb.csv](image/tdda-diff-bat-lamb.png) with `bat.csv` as one,two,three,four,five,six Mary,had,a,little,bat, its,wings,were,black,you,know. And,everywhere,that,Mary,went, that,bat,was,sure,to,go. and `lamb.csv` as one,two,three,four,five,six Mary,had,a,little,lamb, its,fleece,was,white,as,snow. And,everywhere,that,Mary,went, that,lamb,was,sure,to,go. - Notice that the default [`tdda diff`](cli.md#tdda-diff) output: - Starts by summarizing the differences (if any) - Then shows a table with - Only the rows and columns with differences - Different values highlighted in red (left) and green (right), and shared values in the terminal's main colour (e.g. white, here, given the dark background). - Columns interleaved from the two datasets. - It reports the type of the values and uses `<` to denote the left (first) dataset and `>` for the right (second) dataset. - Then exits with code 1 (indicating differences). - Again, if there are no differences, it produces no output and exits with code zero. ## Data Types and Specificity The [`tdda diff`](cli.md#tdda-diff) functionality is new and somewhat experimental, but is powerful. In the case of tables stored in Parquet files, these are just loaded as DataFrames, with types based on Parquet. There are some options about whether to use Pandas or Polars, and in the case of Pandas, which back-end to use, but at least there *is* type information. Flat files, such as CSV files, do not normally carry explicit type information, so loading the data into a DataFrame sometimes requires decisions to be made. There are several choices: 1. If nothing is specified the flat file is loaded using `tdda.serial.csv_to_pandas` (by default), or `tdda.serial.csv_to_polars` if polars is specified with `--polars` or by configuration. TDDA Serial uses `pandas.read_csv` or `polars.read_csv`, changing some default values. 2. If metadata is available describing the format of the flat file, the colon syntax can be used to ask TDDA to use the metadata. Specifying `bat.csv:` will get TDDA to look for a suitable metadata file using naming conventions, most often `@.serial` or `bat-metadata.json` in the same directory. CSVW and Frictionless files can also be used. Specifying `bat.csv:bat-metadata.json` tells TDDA to use the metadata in `bat-metadata.json`. For more detail on this, see [tdda.serial colon format](serialformat.md#tdda-serial-colon-format). The left-hand and right-hand files can be in different format, i.e. comparisons between parquet and flat files are allowed, and also between flat files in quite different formats. Three options are available for controlling how closely types must match (`--loose`, `--medium`, and `--strict`). Similarly, for floating-point comparisons, `--dps N` can be used to specify the level of precision (decimal places). Named fields can be excluded from the comparison with --xfields FIELD1,FIELD2,... or a subset can be explicitly specified for comparison with --fields FIELD1,FIELD2,... If the comparison engine is Pandas (the current default), the backend can be overridden by using `--backend BACKEND` (or `-B BACKEND`) where `BACKEND` is any of `n` for `numpy_nullable` (the default), `a` for `pyarrow`, or `o` for original (non-nullable `ints` etc.) ## Performance and Streaming Currently, the whole of both files have to be read in before the comparison is performed and output is generated after the full comparison is carried out. This means 1. Both files must fit comfortably in memory (together) 2. Very large files with no difference or a single difference may take a long time. 3. Files with huge numbers of differences will generate large output, though `--maxdiffs N` can be used to control this. (By default, there is no limit, i.e. all differences are shown). Future versions may read the files in chunks and stream the output. ## Formatting The output can become wide in the default format if either there are many fields with differences, or fields with long values (particularly string values) with differences. Long field names can also cause this. The `--vertical` (`-V`) flag causes lines from the left and right files to be stacked instead of shown side-by-side. Colours can be controlled with `--colours C` (`-c C`) where c is a hyphen-separated pair of colour names such as `red-blue`, `red-green` etc. The default is `red-green`; this can be changed in the configuration file (see [Configuration](configuration.md)). Monochrome output can be requested with `--mono` (different values in bold, shared values dimmed) or `--bw` (different values in bold, shared values in the terminal's default style). The markers used for left and right datasets can also be chosen with any of --angles the default < and > --LR L: and R: --AE A: and E: (for actual and expected) --pm + and - Alternatively, they can be set explicitly with `--prefixes PREFIXES` where `PREFIXES` is a hyphen-separated pair of prefixes such as `'actual:-ref:'` or `'actual: -ref: '` to have spaces included. ## Joins, Keys, Field Order, and Row Order By default, [`tdda diff`](cli.md#tdda-diff) compares datasets row by row, in order, so if the order is different, or if there are missing rows on one side or the other, this will be reflected in the output. (This is quite different from `diff`.) The `--key FIELDS` flag can be used with a single field or a comma-separated list of fields in `FIELDS` to specify a join key. An outer join will be done on those fields before the comparison is run. The field or fields specified must be a primary key for both datasets, i.e. there must be no duplicates across the keys used. Similarly, the structure is considered important, and if the left and right-hand datasets have different column order, this will be reported and the diff will stop. The ability to specify a sort order for rows or to ignore column order is likely to be added in the future. For example, given the files `a.txt`: b,i,f,s,d False,0,0.5,,1970-01-31T00:00:00 True,1,1.5,a,1999-12-31T23:59:59 and `b.txt` b,i,f,s,d False,0,0.6,,1970-01-31T00:00:00 True,1,1.5,a,1999-12-31T23:59:58 the default [`tdda diff`](cli.md#tdda-diff) would give: ![TDDA Diff Output for a.txt and b.txt](image/tdda-diff-a-b.png) whereas using the `--vertical`, `--mono` and `--AE` would give: ![TDDA Diff Output for a.txt and b.txt with vertical, mono and AE switches](image/tdda-diff-a-b-V-AE.png)