Print to Digital: Cleaning Up Your Word File

This week's post is by Carla Douglas from Beyond Paper Editing.

Adapted. Originally posted at Beyond Paper Editing in August 2014.

If you have a print document that you’d like to self-publish, you can turn it into a digital file and convert it to an ebook.

The first step is to get it into MS Word using OCR software. Note that MS Word is your best friend right now. Editors use Word for a few reasons, and efficient cleanup and editing are high on that list.

Here’s what the file I’m working with looks like as a pdf (produced on a Macintosh Classic and dot matrix printer):

The manuscript has been marked up with pencil, and these marks are picked up by the OCR software, sometimes in unexpected ways. Here’s what the Word file looks like:

Two Kinds of Cleanup

There’s junk in the file—the stuff you can see, and the stuff you can’t. Sometimes, what’s hidden behind the scene in Word is the cause of the junk you can see—things like garbled text and wonky formatting. Also, the pencil marks that haven’t been converted to text remain in the document as pictures, and will have to be deleted. Some random characters appear, too, and the text is all boldface. These are just a few of the things you can see.

To clean up this file, will a spritz of vinegar and water do, or will you need industrial-strength degreaser? The answer depends on what you plan to do with the file next. If you’re going to revise or edit the text, clean it up enough to continue working on it, and save the heavy-duty cleanup for later.

For Initial Cleanup

The story I’m working with here is just over 4,000 words, and it won’t be converted to an ebook any time soon. I’m going to do an initial cleanup using FileCleaner from Jack Lyon’s Editorium. (Wiley Publishing  has a free Word add-in with many similar features.)

FileCleaner is about US$30, but there’s a generous 45-day free trial available. It runs as a Word plug-in. Follow the directions on the site to download and install it. It will appear on the Add-ins pane in your Word ribbon. Here’s what it will do (you can select/unselect features):

Running FileCleaner cleaned up most of the junk in my story file—it’s now in a format I can continue to edit without too many distractions. Here’s what it looks like post-FileCleaner:

As you can see, FileCleaner didn’t catch the text that had been marked up with pencil. After trying a few ways to clean this up—including selecting the text and applying Normal style to it—I ended up having to repair it manually by deleting the picture and re-keying the sentence that’s squished together. Because my document is short, this wasn’t a problem, but in a longer document it could present a significant inconvenience. Here’s a last look at the cleaned-up text:

Other Cleanup Tips

At times, Word can be frustrating to work in—with extra page breaks and hidden formatting, it will do things you don’t want it to. For now, I’ve cleaned my file up well enough to do further editing.

If your Word document is really acting up, there are a few of things to try. I’ve found that the best place to start is by using the show/hide feature on the Word ribbon.  How to Find the Hidden Formatting That Will Mess Up Your Ebook, shows you how.

Image by atomicjeep

How to Turn Your Print Book into a Digital File

My grandmother's typewriter: an Underwood Noiseless Portable
My grandmother's typewriter: an Underwood Noiseless Portable

by Carla Douglas
@CarlaJDouglas

Adapted. Originally posted at Beyond Paper Editing in August 2014.

OCR—not your grandmother’s typewriter!

You can turn your essays, stories and other documents—stuff you might have lying around in a drawer—into ebooks. You also may have unpublished or previously published books, now out of print, that you want to self-publish as ebooks (be certain you own the rights).

You can do this yourself, but first you need to get this material into a digital format. One way is to re-key the text manually (not really an option if you have a book-length work) or you can use optical character recognition (OCR) software, which converts a scanned document into a digital file.

There are many OCR programs available, ranging in price from free to fairly costly. I chose OCRonline to experiment with. It’s web-based, and your first 5 page conversions are free. After that, they’re 4 cents per page. Simply open an account and log in, then follow the instructions.

1. Scan your document and save it as a pdf. The photo at the top of the page? That’s my grandmother’s typewriter. She was a prolific correspondent, and I’m currently digitizing a collection of her letters. Here’s a snippet of one, dated March 13th, 1944:

Tip: Be sure to scan all pages into a single document, or you’ll be stuck (as I was) with multiple separate files that have to be compiled later.

2. Upload your scanned file.  Browse >> Upload

3. Convert your scanned file to MS Word .doc (no .docx option) >>Process


4. Retrieve your converted file at the link provided.
Here’s what my converted snippet looks like:

That’s it! As you can see, the Word file is littered with debris and some ugly bits, but you’re well on your way to having an editable, searchable file, suitable for formatting as an ebook. So go ahead—open your drawer...

Next: File cleanup.

Photo by Carla Douglas