Add docx paper
This commit is contained in:
parent
f9552ab0dd
commit
f07ae547e0
|
@ -0,0 +1,25 @@
|
|||
Title: Converting docx to text with unzip and sed
|
||||
|
||||
Periodically people email me Microsoft Word files which clearly contain
|
||||
only text. Fortunately, Word is now creating OOXML `.docx` files which
|
||||
contain honest to goodness UTF-8 text (and lots of XML tags). This is a
|
||||
step up from the `.doc` format which as near as I could tell needed
|
||||
special libraries to penetrate.
|
||||
|
||||
`.docx` files are zip archives. The archived file `word/document.xml`
|
||||
contains the text of the document itself and can be extracted with
|
||||
`unzip file.docx word/document.xml`.
|
||||
|
||||
If you just want to see the text in a .docx file, you can strip out all
|
||||
XML tags of `word/document.xml`, converting the P tag to a new
|
||||
paragraph. It's surprisingly legible for every .docx file I've seen so
|
||||
far. The sed command would be `s#</w:p>#\n\n#g;s#<[^>]*>##g`.
|
||||
|
||||
I made a shell script called `docx2txt` which contains the unzip command
|
||||
to pipe to stdout, which is read by sed running that crazy script. It
|
||||
looks like this:
|
||||
|
||||
#! /bin/sh
|
||||
|
||||
unzip -qc "$1" word/document.xml | sed 's#</w:p>#\n\n#g;s#<[^>]*>##g'
|
||||
|
|
@ -5,6 +5,7 @@ concept to someone on woozle. Hopefully other people will find them
|
|||
useful, too.
|
||||
|
||||
* [Reply-To Munging Still Considered Harmful](reply-to-still-harmful.html)
|
||||
* [Converting .docx files to text using unzip and sed](docx.html)
|
||||
* [Introduction to TCP Sockets](sockets.html)
|
||||
* [3-Minute HTML Tutorial](html-tutorial.html)
|
||||
* [How DNS Works](DNS.html)
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
[[!meta title="Photobob: Web photo albums"]]
|
||||
Title: Photobob: Web photo albums
|
||||
|
||||
I don't have a lot to say about photobob. It's the 7th or so photo
|
||||
album package I've written, and probably the best. You just put
|
Loading…
Reference in New Issue