I got a question from a colleague a few weeks ago about a potential bug in ExifTool, a fantastic tool and library by Phil Harvey for parsing EXIF data. I had a few minutes this evening, so thought I’d share the digging
We made a couple of word documents, and ran them through ExifTool, to take a look at the ‘Word Count’ field, and later I recreated them at home.
Here are the documents
Here’s the word count according to Word (note: different versions of Word may count things differently; I remember seeing an example back in 2009. Something to do with different versions of Word counting different things as words, rightly or wrongly).
Cool, that matches up with what we’re seeing in the document.
Now let’s run it through ExifTool
um…that’s not right?
So I messaged Phil Harvey about it, and he seemed to indicate that it was a problem with Word not putting the correct information into the metadata.
I decided that I should find another tool and tried out MetaDiver, which uses Apache Tika to parse the metadata.
1.docx
2.docx
And that output matches up with ExifTool’s; it looks like you probably shouldn’t be trusting the word count (and character count as well for that matter) to be accurate in the file’s metadata.
I also created the first docx as an RTF file and got a different word count again! (4 words…still wrong)
Ok! So my bit about verification:
I ran similar documents through X-Ways (screenshots not provided because I don’t have a copy of X-Ways at home) and saw similar data in there as I did in ExifTool. I reached out to Stefan just because I wanted to make sure that X-Ways isn’t using ExifTool as a library, which it turns out it isn’t.
The bit about verification was that it’s worth making sure when verifying data that both tools aren’t actually using the same exact code to parse the data.
After posting I decided I was being lazy by not actually looking for the word count in the docx file itself.
After opening it with 7zip, it’s pretty easy to find – docProps\app.xml shows a word count of 3…which is wrong.
Some quick additional testing showed that when you run word count in Word it will update the value; so everyone, please run word count before you save and shutdown the application, Thanks 🙂