Why is proof reading necessary?
This book has been scanned and OCR (Optical Character Recognition) is used to convert the picture
of the page into text. This is over 99% accurate - unfortunately this in not good enough. With approximately
65 characters per line, one mistake in 100 characters is an error in every second line - or 20 errors per page.
What sort of errors do you find?
Missing letters or words - particularly at the end of lines where the book does not lie flat to the page.
Well, it might
be
so, Mr. Tatham hoped so--but the father, Tatham knew personally
--a man of the worst character, a wine-bibber and an idler
taverns and billiard-rooms, and a notorious insolvent. 'I can
The word "in"is missing completely.
gentleman in a shabby braide~ frock
The OCR has not been able to work out what is the letter - and has put in a ~ instead if d.
Wrong letters - this can fool a spell checker
'Can I have the honour of speaking with Major Pendennis in
private I'
The OCR has substituted "I" for "?" - a common error. Also "had" and "he" become "bad" and "be".
There are a lot of dashes in the test. Is this right?
Dashes are used in several ways in the printed version of the book.
To join words broken at line ends
In spelling where we may no longer use a dash
Long dashes are used as a pause in the flow of text
To make an expletive such as d--n acceptable.
I intend to tidy these up in one pass - it is quicker to do it once and consistently. Don't worry too much about them.
What about spaces?
HTML ignores them - Gutenberg has its own rules. It is better to do them once at the end using search and replace rather
than search out every place where an extra space hides
Why are you using HTML when Gutenberg uses ASCII text?
The book uses italics and has illustrations by the author which need html. An ascii version will be prepared at the end
by stripping out html code.