How do you encode your paper scans?

Atemu@lemmy.ml · 1 year ago

How do you encode your paper scans?

Saigonauticon@voltage.vn · 1 year ago

I use JPEGs in a PDF. They can be mediocre quality. Using an OK scanner makes a big difference. It’s good enough!

I’m required by law to keep physical paper copies for 35 years. So my parallel solution is a cursed filing cabinet, and several crates that describe the content of the filing cabinet. Its ugly, but saves me tons on data archiving, I guess?

loug@lemmy.ca · 1 year ago

Paperless has a tracking method for paper copies as well; i think the idea is you assign an archive number, then file it in the expected place (for example, 2023-01 to 2023-500 would be one of the 500 docs you get a year, then you put it in the filing cabinet in order from 1 to 500 under 2023). Then you can still search for document by name tag correspondent etc. in paperless and find the archive number.

Atemu@lemmy.ml · edit-2 1 year ago

Using an OK scanner makes a big difference.

WDYM? The lossless scans SANE produces themselves subjectively look very good. My only issue is the transcoding to lossy formats I want to do in order to save >3/4 of the space.

Saigonauticon@voltage.vn · 1 year ago

Oh, it’s common in my country to use a smartphone to ‘scan’ documents by actually just taking a lousy photo of them. It’s so prevalent that when you tell someone to do a scan they usually do this instead.

I bought a cheap canon scanner for 50$ and it’s pretty perfect for legal documents. A little slow maybe. I use SANE, then do lossy compression too.

In rare situations I’d then post process the PDF to even worse quality using ghostscript, for example when a foreign visa application form requires a scan of a really long document, but doesn’t accept sizes over 2MB.

Atemu@lemmy.ml · 1 year ago

I use SANE, then do lossy compression too.

Well, what kind of lossy compression? JPEG?

IME, JPEG looks quite terrible for text documents -even at q=95.

dpflug@hachyderm.io · 1 year ago

@Atemu
I just use grayscale PNGs, myself. optipng usually takes them down to a decent size.
@Saigonauticon

Atemu@lemmy.ml · 1 year ago

Hmm, I’m using grayscale PNGs as my baseline here. A 150dpi scan is about 1.3MiB.

A (for the purpose of text documents) similar quality WEBP is about 1/4 of that.

dpflug@hachyderm.io · 1 year ago

@Atemu
Webp is much better, as long as your target reader(s) support it.

Atemu@lemmy.ml · 1 year ago

Yes, as I said.

As also mentioned in the post, I need a solution for multiple pages and an image (no matter what format) only represents a single page and WEBPs don’t go into PDFs.

kyle@infosec.pub · edit-2 1 year ago

I’ve never used paperless but just checked it out and it looks pretty neat. My first thought would be to scan documents in a higher resolution, let the OCR happen, then convert the file to a JPEG or something smaller after you’ve extracted the text.

I spent a few minutes looking at their wiki and it looks like it might be possible.

Like I said though, no experience with this software so I’m not sure that’d actually work.

Atemu@lemmy.ml · 1 year ago

Interesting idea but I think I’d like to retain similar to original quality in case I wanted to redo OCR if/when Paperless’ OCR improves in the future.

surewhynotlem@lemmy.world · 9 months ago

By ‘paperless’, y’all mean this one? https://docs.paperless-ngx.com/

Atemu@lemmy.ml · 9 months ago

Correct. That’s the currently maintained paperless project.

surewhynotlem@lemmy.world · 9 months ago

Thanks! There’s a very interesting trail of dead projects to follow. But I got ngx working and it’s great so far.

Atemu@lemmy.ml · 9 months ago

I for one am still waiting for paperless-ngnxn2-next-3.0_hypr.

lemming007@lemm.ee · 1 year ago

PDF/A

Atemu@lemmy.ml · 1 year ago

And how do you encode the images of the scan contained in the PDF/A? That’s the crux here.

lemming007@lemm.ee · 1 year ago

I’m not sure I understand. I just scan anything and let my software spit out PDF/A

Atemu@lemmy.ml · 1 year ago

PDF/A is not an image format. As a document, it may contain images.

lemming007@lemm.ee · edit-2 1 year ago

My PDF/A documents contain all kinds of content, including text and images. To me, it doesn’t matter what format the encoded images are, as long as I can open them 20 years from now. Why would one care one way or another?

Atemu@lemmy.ml · 1 year ago

I care that the text remains readable (both to me and also software) and that I don’t balloon my storage out of control.

JPEG (even at higher levels) subjectively degrades text in particular to a degree that I worry about the former and PNG makes me worry about the latter.

My current plan is to go with the latter since storage is a relatively cheap issue to fix while data loss is pretty much permanent.