homepage/content/papers/docx.md

---
title: Converting docx to text with unzip and sed
section: computing
---

Periodically people email me Microsoft Word files which clearly contain
only text.  Fortunately, Word is now creating OOXML `.docx` files which
contain honest to goodness UTF-8 text (and lots of XML tags).  This is a
step up from the `.doc` format which as near as I could tell needed
special libraries to penetrate.

`.docx` files are zip archives.  The archived file `word/document.xml`
contains the text of the document itself and can be extracted with
`unzip file.docx word/document.xml`.

If you just want to see the text in a .docx file, you can strip out all
XML tags of `word/document.xml`, converting the P tag to a new
paragraph.  It's surprisingly legible for every .docx file I've seen so
far.  The sed command would be `s#</w:p>#\n\n#g;s#<[^>]*>##g`.

I made a shell script called `docx2txt` which contains the unzip command
to pipe to stdout, which is read by sed running that crazy script.  It
looks like this:

    #! /bin/sh

    unzip -qc "$1" word/document.xml | sed 's#</w:p>#\n\n#g;s#<[^>]*>##g'

There are other, probably more powerful, docx to text converters on the
Internet.  The advantage of mine is simplicity, when all you want to do
is read the text and move on with your life.
Reworked the crap out of this 2017-07-09 18:13:41 -06:00			`---`
			`title: Converting docx to text with unzip and sed`
Move to Hugo, then move to self-hosted 2022-09-04 16:59:13 -06:00			`section: computing`
Reworked the crap out of this 2017-07-09 18:13:41 -06:00			`---`
Add docx paper 2010-07-13 16:37:13 -06:00
			`Periodically people email me Microsoft Word files which clearly contain`
			only text. Fortunately, Word is now creating OOXML `.docx` files which
			`contain honest to goodness UTF-8 text (and lots of XML tags). This is a`
			step up from the `.doc` format which as near as I could tell needed
			`special libraries to penetrate.`

			`.docx` files are zip archives. The archived file `word/document.xml`
			`contains the text of the document itself and can be extracted with`
			`unzip file.docx word/document.xml`.

			`If you just want to see the text in a .docx file, you can strip out all`
			XML tags of `word/document.xml`, converting the P tag to a new
			`paragraph. It's surprisingly legible for every .docx file I've seen so`
			far. The sed command would be `s#</w:p>#\n\n#g;s#<[^>]*>##g`.

			I made a shell script called `docx2txt` which contains the unzip command
			`to pipe to stdout, which is read by sed running that crazy script. It`
			`looks like this:`

			`#! /bin/sh`

			`unzip -qc "$1" word/document.xml \| sed 's#</w:p>#\n\n#g;s#<[^>]*>##g'`

Not this shit guy 2010-09-21 22:44:01 -06:00			`There are other, probably more powerful, docx to text converters on the`
			`Internet. The advantage of mine is simplicity, when all you want to do`
			`is read the text and move on with your life.`