Add docx paper

This commit is contained in:
Neale Pickett 2010-07-13 17:37:13 -05:00
parent f9552ab0dd
commit f07ae547e0
3 changed files with 27 additions and 1 deletions

25
papers/docx.mdwn Normal file
View File

@ -0,0 +1,25 @@
Title: Converting docx to text with unzip and sed
Periodically people email me Microsoft Word files which clearly contain
only text. Fortunately, Word is now creating OOXML `.docx` files which
contain honest to goodness UTF-8 text (and lots of XML tags). This is a
step up from the `.doc` format which as near as I could tell needed
special libraries to penetrate.
`.docx` files are zip archives. The archived file `word/document.xml`
contains the text of the document itself and can be extracted with
`unzip file.docx word/document.xml`.
If you just want to see the text in a .docx file, you can strip out all
XML tags of `word/document.xml`, converting the P tag to a new
paragraph. It's surprisingly legible for every .docx file I've seen so
far. The sed command would be `s#</w:p>#\n\n#g;s#<[^>]*>##g`.
I made a shell script called `docx2txt` which contains the unzip command
to pipe to stdout, which is read by sed running that crazy script. It
looks like this:
#! /bin/sh
unzip -qc "$1" word/document.xml | sed 's#</w:p>#\n\n#g;s#<[^>]*>##g'

View File

@ -5,6 +5,7 @@ concept to someone on woozle. Hopefully other people will find them
useful, too. useful, too.
* [Reply-To Munging Still Considered Harmful](reply-to-still-harmful.html) * [Reply-To Munging Still Considered Harmful](reply-to-still-harmful.html)
* [Converting .docx files to text using unzip and sed](docx.html)
* [Introduction to TCP Sockets](sockets.html) * [Introduction to TCP Sockets](sockets.html)
* [3-Minute HTML Tutorial](html-tutorial.html) * [3-Minute HTML Tutorial](html-tutorial.html)
* [How DNS Works](DNS.html) * [How DNS Works](DNS.html)

View File

@ -1,4 +1,4 @@
[[!meta title="Photobob: Web photo albums"]] Title: Photobob: Web photo albums
I don't have a lot to say about photobob. It's the 7th or so photo I don't have a lot to say about photobob. It's the 7th or so photo
album package I've written, and probably the best. You just put album package I've written, and probably the best. You just put