Paperless office; document/image processing

Sort

evenwicht @lemmy.sdf.org 10 hr. ago

(PDF neutering) Not all PDFs are documents; some are apps! Insurance company sent me a form to sign as a PDF with ~~JavaScript~~ Java. Is it a tracker?

They emailed me a PDF. It opened fine with evince and looked like a simple doc at first. Then I clicked on a field in the form. Strangely, instead of simply populating the field with my text, a PDF note window popped up so my text entry went into a PDF note, which many viewers present as a sticky note icon.

If I were to fax this PDF, the PDF comments would just get lost. So to fill out the form I fed it to LaTeX and used the overpic pkg to write text wherever I choose. LaTeX rejected the file.. could not handle this PDF. Then I used the file command to see what I am dealing with: $ file signature_page.pdf signature_page.pdf: Java serialization data, version 5 WTF is that? I know PDF supports JavaScript (shitty indeed). Is that what this is? “Java” is not JavaScript, so I’m baffled. Why is java in a PDF? (edit: explainer on java serialization, and some analysis)

My workaround was to use evince to print the PDF to PDF (using a PDF-building printer driver or whatever evince uses), then feed that into LaTeX. That worked.

My question is, how common is this? Is it going to become a mechanism to embed a tracking pixel like corporate assholes do with HTML email?

I probably need to change my habits. I know PDF docs can serve as carriers of copious malware anyway. Some people go to the extreme of creating a one-time use virtual machine with PDF viewer which then prints a PDF to a PDF before destroying the VM which is assumed to be compromised.

My temptation is to take a less tedious approach. E.g. something like: $ firejail --net=none evince untrusted.pdf I should be able to improve on that by doing something non-interactive. My first guess: $ firejail --net=none gs -sDEVICE=pdfwrite -q -dFIXEDMEDIA -dSCALE=1 -o is_this_output_safe.pdf -- /usr/share/ghostscript/*/lib/viewpbm.ps untrusted_input.pdf output: Error: /invalidfileaccess in --file-- Operand stack: (untrusted_input.pdf) (r) Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1990 1 3 %oparray_pop 1989 1 3 %oparray_pop 1977 1 3 %oparray_pop 1833 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- %array_continue --nostringval-- Dictionary stack: --dict:769/1123(ro)(G)-- --dict:0/20(G)-- --dict:87/200(L)-- --dict:0/20(L)-- Current allocation mode is local Last OS error: Permission denied Current file position is 10479 GPL Ghostscript 10.00.0: Unrecoverable error, exit code 1 What’s my problem? Better ideas? I would love it if attempts to reach the cloud could be trapped and recorded to a log file in the course of neutering the PDF.

(note: I also wonder what happens when Firefox opens this PDF, because Mozilla is happy to blindly execute whatever code it receives no matter the context.)

1
plantteacher @mander.xyz 1 mo. ago

How to obtain the density (DPI / PPI) of a PGM file -- anyone know? ImageMagick does not cut it.

Running this gives the geometry but not the density: $ identify -verbose myfile.pgm | grep -iE 'geometry|pixel|dens|size|dimen|inch|unit' There is also a “Pixels per second” attribute which means nothing to me. No density and not even a canvas/page dimension (which would make it possible to compute the density). The “Units” attribute on my source images are “undefined”.

Suggestions?

4
ulo @discuss.tchncs.de 6 mo. ago

Safe enough for public webserver?

I just discovered this software and like it very much.

Would you consider it safe enough to use it with my personal documents on a public webserver?

0
freedomPusher @sopuli.xyz 8 mo. ago

PDF renders radically different between Adobe Acrobat® vs. evince & okular (GhostScript-based)

mirror.ctan.org /macros/latex/contrib/pdfcomment/doc/example.pdf

The linked doc is a PDF which looks very different in Adobe Acrobat than it does in evince and okular, which I believe are both based on the same GhostScript library.

So the question is, is there an alternative free PDF viewer that does not rely on the GhostScript library for rendering?

#AskFedi

0
freedomPusher @sopuli.xyz 9 mo. ago

[solved] TIFF → DjVu conversion produces bigger file from bilevel doc than color

I would like to get to the bottom of what I am doing wrong that leads to black and white documents having a bigger filesize than color.

My process for a color TIFF is like this:

① tiff2pdf ② ocrmypdf ③ pdf2djvu

Resulting color DjVu file is ~56k. When pdfimages -all runs on the intermediate PDF file, it shows CCITT (fax) is inside.

My process for a black and white TIFF is the same:

① tiff2pdf ② ocrmypdf ③ pdf2djvu

Resulting black and white DjVu file is ~145k (almost 3× the color size). When pdfimages -all runs on the intermediate PDF file, it shows a PNG file is inside. If I replace step ① with ImageMagick’s convert, the first PDF is 10mb, but in the end the resulting djvu file is still ~145k. And PNG is still inside the intermediate PDF.

I can get the bitonal (bilevel) image smaller by using cjb2 -clean, which goes straight from TIFF to DjVu, but then I can’t OCR it due to the lack of PDF intermediate version. And the size is still bigger than the color doc (~68k).

update ---

I think I found the problem, which would not be evident from what I posted. I was passing the --force-ocr option to ocrmypdf. I did that just to push through errors like “this doc is already OCRd”. But that option does much more than you would expect: it transcodes the doc. Looks like my fix is to pass --redo-ocr instead. It’s not yet obvious to me why --force-ocr impacted bilevel images more.

#askFedi

0

7 Active users