diff --git a/papers/docx.mdwn b/papers/docx.mdwn new file mode 100644 index 0000000..be212a6 --- /dev/null +++ b/papers/docx.mdwn @@ -0,0 +1,25 @@ +Title: Converting docx to text with unzip and sed + +Periodically people email me Microsoft Word files which clearly contain +only text. Fortunately, Word is now creating OOXML `.docx` files which +contain honest to goodness UTF-8 text (and lots of XML tags). This is a +step up from the `.doc` format which as near as I could tell needed +special libraries to penetrate. + +`.docx` files are zip archives. The archived file `word/document.xml` +contains the text of the document itself and can be extracted with +`unzip file.docx word/document.xml`. + +If you just want to see the text in a .docx file, you can strip out all +XML tags of `word/document.xml`, converting the P tag to a new +paragraph. It's surprisingly legible for every .docx file I've seen so +far. The sed command would be `s##\n\n#g;s#<[^>]*>##g`. + +I made a shell script called `docx2txt` which contains the unzip command +to pipe to stdout, which is read by sed running that crazy script. It +looks like this: + + #! /bin/sh + + unzip -qc "$1" word/document.xml | sed 's##\n\n#g;s#<[^>]*>##g' + diff --git a/papers/index.mdwn b/papers/index.mdwn index 3c5814a..33ecbd0 100644 --- a/papers/index.mdwn +++ b/papers/index.mdwn @@ -5,6 +5,7 @@ concept to someone on woozle. Hopefully other people will find them useful, too. * [Reply-To Munging Still Considered Harmful](reply-to-still-harmful.html) +* [Converting .docx files to text using unzip and sed](docx.html) * [Introduction to TCP Sockets](sockets.html) * [3-Minute HTML Tutorial](html-tutorial.html) * [How DNS Works](DNS.html) diff --git a/src/photobob.mdwn b/src/photobob/index.mdwn similarity index 98% rename from src/photobob.mdwn rename to src/photobob/index.mdwn index 2d2c95e..86810ad 100644 --- a/src/photobob.mdwn +++ b/src/photobob/index.mdwn @@ -1,4 +1,4 @@ -[[!meta title="Photobob: Web photo albums"]] +Title: Photobob: Web photo albums I don't have a lot to say about photobob. It's the 7th or so photo album package I've written, and probably the best. You just put