Git-diffing Excel files

Posted on Fri 04 March 2016 in Tools

In this post, I’m going to demonstrate how to get a useful diff for binary Excel files that reside in a Git repository. I assume that you’re familiar with Git, and in that case you know that having binary files in a repository is generally not a good idea. Nonetheless, sometimes it’s useful—perhaps they contain test data, or import data, or base data for bootstrapping, or some other useful data that aren’t suitable for text form.

Diffing a binary file

For whatever reason you have an Excel file in your repository, finding a difference in the file between two commits is by default not possible. Let’s say that you have uncommitted changes to an Excel file called document.xlsx. This is what happens if you try to see what has changed:

$ git diff document.xlsx
diff --git a/document.xlsx b/document.xlsx
index b22099f..ebebbb9 100644
Binary files a/document.xlsx and b/document.xlsx differ

Since the file is binary rather than text, Git gives up and just prints that there is a difference. But that will soon change! :-)

Defining an Excel diff driver

First, download the ExcelCompare tool. Click on Releases and get the latest one—I used 0.5.1 when testing. Unpack the ZIP file in a folder of your choosing, but make sure it’s in a persistent location (i.e., unpacking it in a Downloads folder is probably a bad idea). I unpacked to c:\apps\excelcompare.

By now, it’s probably a good time to mention that I use Git Bash on Windows. That affects the path handling a bit, as you’ll see.

We will write a script to perform the actual diffing. To begin, edit the .git/config file in your repository and amend the following:

[diff "excel"]
  command = c:/apps/excelcompare/exceldiff.cmd

This registers a diff driver in your repository, telling Git which command to use when diffing files of excel type. The exceldiff.cmd script doesn’t exist yet, but it will in a minute. If you want to place the script somewhere else, or name it differently, feel free to do so.

Do you see the path oddity? The command path looks almost like a regular Windows path, except it has forward slashes instead of backward slashes. Using backward slashes, even if they are properly escaped, did not work for me.

Next, we must tell Git to associate certain files with the excel diff driver. To do so, edit .gitattributes in the root of your repository and amend the following:

*.xlsx diff=excel

Feel free to add more file patterns (one per line) if you wish. ExcelCompare supports a number of different Excel file types.

Note that .gitattributes (unlike .gitignore) must be committed to have any effect, so add it to a commit at this point.

Finally, open up an editor and create exceldiff.cmd in the correct location (as specified in .git/config):

@echo off
c:\apps\excelcompare\excel_cmp.bat %2 %5

(Adjust the path to excel_cmp.bat so that it matches the location where you unpacked ExcelCompare.)

Why arguments two and five? It’s because Git passes seven arguments to a diff command, of which the second is the path to the old file (which likely is a temporary file) and the fifth is the path to the new file. For more detailed information, have a look at the Git documentation, specifically under GIT_EXTERNAL_DIFF.

Back to diffing

Once the script has been saved, we can attempt the diff again:

$ git diff document.xlsx
DIFF  Cell at     Sheet1!A3 => 'X' v/s 'Y'
----------------- DIFF -------------------
Sheets: [Sheet1]
Rows: [3]
Cols: [A]
----------------- EXTRA WB1 -------------------
Sheets: []
Rows: []
Cols: []
----------------- EXTRA WB2 -------------------
Sheets: []
Rows: []
Cols: []
-----------------------------------------
Excel files C:\...\AppData\Local\Temp\TRAkga_document.xlsx and document.xlsx differ

This is the raw output from ExcelCompare. As can be seen, the uncommitted change is apparently that I changed the value of cell A3 on sheet Sheet1 from X to Y (which happens to be exactly what I did).

And that’s it! Much, much more useful than before!

A small note

As can be seen from the example above, the diff output doesn’t have the same format as a “native” textual Git diff, which may look like this:

$ git diff foo
diff --git a/foo b/foo
index 9972d7e..8320b47 100644
--- a/foo
+++ b/foo
@@ -1,2 +1,2 @@
-bar
+baz

Specifically, the initial metadata lines are not present, and the differences are not in patch format. This means that an Excel diff cannot be used for patching an Excel file. If you attempt to use git format-patch, you’ll see that it just ignores the Excel diff.

Summary

In this post, I have shown how to configure a Git repository to use a custom command for showing the difference between different revisions of an Excel file. If you found it useful, please let me know in the comments!

Update 2016-04-17

  • Fixed some typos.
  • Fixed incorrect .git/config entry.
  • Added a paragraph about committing the .gitattributes files.