ripperdoc

command module
v0.0.0-...-f67b8da Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 9, 2024 License: MIT Imports: 7 Imported by: 0

README

=pod

=head1 Ripperdoc: a tool to quickly grab all the text out of a .docx file

=for html <img src="mascot.jpg" alt="ripperdoc mascot">

- unzip, nab text from xml, dump in outfile (ez perl script)

- document.xml is the main one

- https://gist.github.com/felipeochoa/81d8fa27901e8222c6ffbeb165a85acc

- golang: encoding/xml and archive/zip

- x.docx/word/document.xml, grep for /<w:t>(.*?)<\/w:t>/

- use impreza wiring in archive/manuals for a test

Approximate equivalent of this pwsh line: mkdir extract-tmp; expand-archive '.\Impreza Wiring.docx' extract-tmp; cd extract-tmp/word; perl '-Mv5.10' -lne '$_ =~ /<w:t>(.*)+?<\/w:t>/; say @{^CAPTURE};' document.xml

It'd be better if I actually traversed the XML, since I could make better judgments on document reconstruction. However, I don't actually care.

=head2 Done

- open docx file, dying if can't

- open word/document.xml or die if it isn't there

- close all of the above and exit

- grep through document.xml for text, dumping it to stdout

- resolve issue with nabbing formatted text

- add file specification

- help function (-h)

- make it automatically run goroutines

- switch to specify output file (-o)

- "smart mode" - reconstruct document (join lines if they start/end with a whitespace) (-s)

=head2 TODO

- make error handling better

=cut

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL