Add docx paper

2010-07-13 17:37:13 -05:00 · 2010-07-13 17:37:13 -05:00 · f07ae547e0
parent f9552ab0dd
commit f07ae547e0
3 changed files with 27 additions and 1 deletions
--- a/papers/docx.mdwn
+++ b/papers/docx.mdwn
@ -0,0 +1,25 @@
+Title: Converting docx to text with unzip and sed
+
+Periodically people email me Microsoft Word files which clearly contain
+only text.  Fortunately, Word is now creating OOXML `.docx` files which
+contain honest to goodness UTF-8 text (and lots of XML tags).  This is a
+step up from the `.doc` format which as near as I could tell needed
+special libraries to penetrate.
+
+`.docx` files are zip archives.  The archived file `word/document.xml`
+contains the text of the document itself and can be extracted with
+`unzip file.docx word/document.xml`.
+
+If you just want to see the text in a .docx file, you can strip out all
+XML tags of `word/document.xml`, converting the P tag to a new
+paragraph.  It's surprisingly legible for every .docx file I've seen so
+far.  The sed command would be `s#</w:p>#\n\n#g;s#<[^>]*>##g`.
+
+I made a shell script called `docx2txt` which contains the unzip command
+to pipe to stdout, which is read by sed running that crazy script.  It
+looks like this:
+
+    #! /bin/sh
+
+    unzip -qc "$1" word/document.xml | sed 's#</w:p>#\n\n#g;s#<[^>]*>##g'
+
--- a/papers/index.mdwn
+++ b/papers/index.mdwn
@ -5,6 +5,7 @@ concept to someone on woozle.  Hopefully other people will find them
 useful, too.

 * [Reply-To Munging Still Considered Harmful](reply-to-still-harmful.html)
+* [Converting .docx files to text using unzip and sed](docx.html)
 * [Introduction to TCP Sockets](sockets.html)
 * [3-Minute HTML Tutorial](html-tutorial.html)
 * [How DNS Works](DNS.html)
--- a/src/photobob/index.mdwn
+++ b/src/photobob/index.mdwn
@ -1,4 +1,4 @@
-[[!meta title="Photobob: Web photo albums"]]
+Title: Photobob: Web photo albums

 I don't have a lot to say about photobob.  It's the 7th or so photo
 album package I've written, and probably the best.  You just put