未加星标

Rasterize everything in pdf except text

字体大小 | |
[系统(linux) 所属分类 系统(linux) | 发布者 店小二03 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏

I had an issue including a PDF with transparency as a subfigure to another PDF. This lead me down a dark path of trying to rasterize everything in a pdf except for the text. I tried rasterizing everything and just running OCR on top of the text but OCR-ized selection is weird and the text recognition wasn’t perfect. Not to mention that would have been a really round about way to solve this.

Here’s the insane pipeline I settled on:

open the PDF in illustrator save as input.svg , under options “use system fonts”, run ./rasterize-everything-but-text.sh input.svg output.svg (see below) open output.svg in illustrator, save as raster-but-text.pdf

The bash script ./rasterize-everything-but-text.sh is itself an absurd, likely very fragile text manipulation and rasterization of the .svg files:

#!/bin/bash # # Usage: # # rasterize-everything-but-text.sh input.svg output.svg # input="$1" output="$2" # suck out header from svg file header=`dos2unix < $input | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\).*/\1/' | tr '\00' '\n'` # grab all text tags text=`cat $input | grep "<text.*"` # create svg file without text tags notextsvg="no-text.svg" notextpng="no-text.png" cat $input | grep -v "<text.*" > $notextsvg # convert to png rsvg-convert -h 1000 $notextsvg > $notextpng # convert back to svg (containing just <image> tag) rastersvg="raster.svg" convert $notextpng $rastersvg # extract body (image tag) body=`dos2unix < $rastersvg | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\)\(.*\)<\/svg>/\2/' | tr '\00' '\n'` # piece together original header, image tag, and text echo "$header $body $text </svg>" > "$output" # Fix image tag to have same size as document dim=`echo "$header" | grep -o 'width=".*" height="[^"]*"' | tr '"' "'"` sed -i '' "s/\(image id=\"image0\" \)width=\".*\" height=\"[^\"]*\"/\1$dim/" $output

Tags:bash, convert , dos2unix , pdf , rsvg-convert , sed , svg , tr

This entry was posted on Wednesday, October 19th, 2016 at 12:36 am and is filed undercode. You can follow any responses to this entry through theRSS 2.0 feed. You can, ortrackback from your own site.

本文系统(linux)相关术语:linux系统 鸟哥的linux私房菜 linux命令大全 linux操作系统

分页:12
转载请注明
本文标题:Rasterize everything in pdf except text
本站链接:http://www.codesec.net/view/485571.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 系统(linux) | 评论(0) | 阅读(25)