many more formatting fixex

This commit is contained in:
Uwe Steinmann 2025-10-23 13:37:52 +02:00
parent 7e2803da25
commit 00e6a22dbd

View File

@ -1,5 +1,4 @@
Commands for converting documents # Commands for converting documents
----------------------------------
This file contains commands for converting different document types This file contains commands for converting different document types
into into
@ -26,13 +25,14 @@ UTF-8 chars. In such a case you may want to set `clear_env=no` in
php-fpm's configuration. On Debian this is done in the file php-fpm's configuration. On Debian this is done in the file
`/etc/php/<php version>/fpm/pool.d/www.conf`. Search for `clear_env`. `/etc/php/<php version>/fpm/pool.d/www.conf`. Search for `clear_env`.
Conversion to text for fulltext search ## Conversion to text for fulltext search
=======================================
### text/plain, text/csv, application/csv
* text/plain, text/csv, application/csv
`cat '%s'` `cat '%s'`
* application/pdf ### application/pdf
`pdftotext -q -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'` `pdftotext -q -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'`
If pdftotext takes too long on large document you may want to pass parameter If pdftotext takes too long on large document you may want to pass parameter
@ -41,87 +41,93 @@ Conversion to text for fulltext search
`mutool draw -F txt -q -N -o - %s ` `mutool draw -F txt -q -N -o - %s `
* application/vnd.openxmlformats-officedocument.wordprocessingml.document ### application/vnd.openxmlformats-officedocument.wordprocessingml.document
`docx2txt '%s' -` `docx2txt '%s' -`
* application/msword ### application/msword
`catdoc %s` `catdoc %s`
* application/vnd.oasis.opendocument.text ### application/vnd.oasis.opendocument.text
`odt2txt %s` `odt2txt %s`
* application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ### application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
`xlsx2csv -d tab %s` `xlsx2csv -d tab %s`
* application/vnd.ms-excel ### application/vnd.ms-excel
`xls2csv -d tab %s` `xls2csv -d tab %s`
* text/html ### text/html
`html2text %s` `html2text %s`
Many office formats Many office formats can be converted with `unoconv`, though this turned
unoconv -d document -f txt --stdout '%s' out in the past to sometimes crash or taking a long time.
`unoconv -d document -f txt --stdout '%s'`
Apache Tika is another option for creating plain text from various document Apache Tika is another option for creating plain text from various document
types. Just use curl to send the document to your tika server and get the types. Just use `curl` to send the document to your tika server and get the
plain text in return. plain text in return.
curl -s -T '%s' http://localhost:9998/tika --header 'Accept: text/plain' `curl -s -T '%s' http://localhost:9998/tika --header 'Accept: text/plain'`
Conversion to pdf for pdf preview Of course this requires to first install Apache Tika when using the docker
================================== image.
text/plain ## Conversion to pdf for pdf preview
text/csv
application/csv
application/vnd.oasis.opendocument.text
application/msword
application/vnd.wordperfect
text/rtf
unoconv -d document -f pdf --stdout -v '%f' > '%o'
image/png * text/plain, text/csv, application/csv, application/vnd.oasis.opendocument.text application/msword, application/vnd.wordperfect, text/rtf
image/jpg
image/jpeg
convert -density 300 '%f' 'pdf:%o'
image/svg+xml `unoconv -d document -f pdf --stdout -v '%f' > '%o'`
cairosvg -f pdf -o '%o' '%f'
application/vnd.ms-powerpoint * image/png, image/jpg, image/jpeg
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.oasis.opendocument.presentation
unoconv -d presentation -f pdf --stdout -v '%f' > '%o'
application/vnd.ms-excel `convert -density 300 '%f' 'pdf:%o'`
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.oasis.opendocument.spreadsheet
unoconv -d spreadsheet -f pdf --stdout -v '%f' > '%o'
message/rfc822 Actually `convert` can be used for many other image formats.
java -jar emailconverter-2.5.3-all.jar '%f' -o '%o'
* image/svg+xml
`cairosvg -f pdf -o '%o' '%f'`
* application/vnd.ms-powerpoint, application/vnd.openxmlformats-officedocument.presentationml.presentation, application/vnd.oasis.opendocument.presentation
`unoconv -d presentation -f pdf --stdout -v '%f' > '%o'`
* application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.oasis.opendocument.spreadsheet
`unoconv -d spreadsheet -f pdf --stdout -v '%f' > '%o'`
* message/rfc822
`java -jar emailconverter-2.5.3-all.jar '%f' -o '%o'`
The emailconverter can be obtained from https://github.com/nickrussler/email-to-pdf-converter The emailconverter can be obtained from https://github.com/nickrussler/email-to-pdf-converter
It requires wkhtmltopdf which is part of debian. It requires `wkhtmltopdf` which is part of debian.
text/plain * text/plain
iconv -c -f utf-8 -t latin1 '%f' | a2ps -1 -q -a1 -R -B -o - - | ps2pdf - -
`iconv -c -f utf-8 -t latin1 '%f' | a2ps -1 -q -a1 -R -B -o - - | ps2pdf - -`
The parameter `-q` is important because a2ps sends some statistical The parameter `-q` is important because a2ps sends some statistical
data to stderr, which makes SeedDMS believe the command has failed. data to stderr, which makes SeedDMS believe the command has failed.
application/x-xopp * application/x-xopp
xournalpp -p "%o" "%f" `xournalpp -p "%o" "%f"`
Converting from application/x-xopp to pdf only works if the xopp file Converting from application/x-xopp to pdf only works if the xopp file
does not use a pdf document as a background, because this pdf is not does not use a pdf document as a background, because this pdf is not
stored in the xopp fіle. stored in the xopp fіle.
Conversion to png for preview images ## Conversion to png for preview images
=====================================
If you have problems running convert on PDF documents then read this page If you have problems running convert on PDF documents then read the page
https://askubuntu.com/questions/1081895/trouble-with-batch-conversion-of-png-to-pdf-using-convert https://askubuntu.com/questions/1081895/trouble-with-batch-conversion-of-png-to-pdf-using-convert
It basically instructs you to comment out the line It basically instructs you to comment out the line
@ -129,75 +135,71 @@ It basically instructs you to comment out the line
in /etc/ImageMagick-6/policy.xml in /etc/ImageMagick-6/policy.xml
convert determines the format of the converted image from the extension of `convert` determines the format of the converted image from the extension of
the output filename. SeedDMS usually sets a propper extension when running the output filename. SeedDMS usually sets a propper extension when running
the command, but nevertheless it is good practice to explicitly set the output the command, but nevertheless it is good practice to explicitly set the output
format by prefixing the output filename with 'png:'. This is of course always format by prefixing the output filename with 'png:'. This is of course always
needed if the output goes to stdout. needed if the output goes to stdout.
image/jpg ### image/jpg, image/jpeg, image/png
image/jpeg
image/png
convert -resize %wx '%f' 'png:%o'
image/svg+xml `convert -resize %wx '%f' 'png:%o'`
cairosvg -f png --output-width %w -o '%o' '%f'
text/plain * image/svg+xml
convert -density 100 -resize %wx 'text:%f[0]' 'png:%o'
application/pdf `cairosvg -f png --output-width %w -o '%o' '%f'`
gs -dBATCH -dNOPAUSE -sDEVICE=png16m -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q '%f' | convert -resize %wx png:- '%o'
convert -density 100 -resize %wx '%f[0]' 'png:%o' * text/plain
mutool draw -F png -w %w -q -N -o '%o' '%f' 1 `convert -density 100 -resize %wx 'text:%f[0]' 'png:%o'`
pdftocairo '%f' -png -singlefile -scale-to-x %w -scale-to-y -1 - > '%o' * application/pdf
pdftocairo needs to output to stdout because the output file name passed `gs -dBATCH -dNOPAUSE -sDEVICE=png16m -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q '%f' | convert -resize %wx png:- '%o'`
to pdftocairo will be suffixed with png
application/postscript `convert -density 100 -resize %wx '%f[0]' 'png:%o'`
convert -density 100 -resize %wx '%f[0]' 'png:%o'
`mutool draw -F png -w %w -q -N -o '%o' '%f' 1`
`pdftocairo '%f' -png -singlefile -scale-to-x %w -scale-to-y -1 - > '%o'`
`pdftocairo` needs to output to stdout because the output file name passed
to pdftocairo will be suffixed with `.png`
* application/postscript
`convert -density 100 -resize %wx '%f[0]' 'png:%o'`
* text/plain
text/plain
iconv -c -f utf-8 -t latin1 '%f' | a2ps -1 -q -a1 -R -B -o - - | gs -dBATCH -dNOPAUSE -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -dPDFFitPage -r72x72 -sOutputFile=- -q - | convert -resize %wx png:- 'png:%o' iconv -c -f utf-8 -t latin1 '%f' | a2ps -1 -q -a1 -R -B -o - - | gs -dBATCH -dNOPAUSE -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -dPDFFitPage -r72x72 -sOutputFile=- -q - | convert -resize %wx png:- 'png:%o'
On Linux systems you will have to set the desired value in /etc/papersize for a2ps On Linux systems you will have to set the desired value in /etc/papersize for a2ps
e.g. a4, or letter. Unfortunately, a2ps cannot process utf-8 encoded files. That's e.g. a4, or letter. Unfortunately, a2ps cannot process utf-8 encoded files. That's
why the input needs to be recoded with iconv or recode. why the input needs to be recoded with iconv or recode.
application/msword * application/msword, application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.wordprocessingml.document, text/rtf, application/vnd.ms-powerpoint, text/csv, application/csv, application/vnd.wordperfect,
application/vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.text `unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'`
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-excel * video/webm, video/mp4
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/rtf
application/vnd.ms-powerpoint
text/csv
application/csv
application/vnd.wordperfect
unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'
video/webm
video/mp4
This will take 12th frame of a video and converts into a png. It requires This will take 12th frame of a video and converts into a png. It requires
ffmpeg to be installed. ffmpeg to be installed.
convert -resize %wx "%f[12]" "png:%o" `convert -resize %wx "%f[12]" "png:%o"`
You may as well use ffmpeg right away You may as well use ffmpeg right away
ffmpeg -i "%f" -ss 00:00:02 -frames:v 1 -loglevel quiet -vf scale=%w:-1 -f apng "%o" `ffmpeg -i "%f" -ss 00:00:02 -frames:v 1 -loglevel quiet -vf scale=%w:-1 -f apng "%o"`
audio/mpeg * audio/mpeg
sox "%f" -n spectrogram -x 600 -Y 550 -r -l -o - | convert -resize %wx png:- "png:%o" `sox "%f" -n spectrogram -x 600 -Y 550 -r -l -o - | convert -resize %wx png:- "png:%o"`
application/x-xopp * application/x-xopp
xournalpp -i "%o" --export-png-width=%w "%f"
`xournalpp -i "%o" --export-png-width=%w "%f"`
Converting from application/x-xopp to png only works if the xopp file Converting from application/x-xopp to png only works if the xopp file
does not use a pdf document as a background, because this pdf is not does not use a pdf document as a background, because this pdf is not