Global Blind Spot: juni 2012

I finally got around to trying a rudimentary PDF to LaTeX conversion in Linux.

"It's like turning a hamburger into a cow" :-)

Usage:

./pdftolatex.sh "filename.pdf"

Example output:

# pdftolatex.sh test.pdf

PDF to LaTeX conversion script
Copyleft 2012 (c) Tom Van Braeckel <tomvanbraeckel@gmail.com>

WARNING: this is a rudimentary first stab. Proceed with caution.

Checking if all dependencies are found...
/usr/bin/pdftohtml
Dependency pdftohtml found.
/usr/bin/gnuhtml2latex
Dependency gnuhtml2latex found.
/usr/bin/pdflatex
Dependency pdflatex found.

Converting test.pdf to test.pdfs.html
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11

Fixing up test.pdfs.html to test.pdf_fixedup.html...

Readying test.pdf_fixedup.html for tex conversion in test.pdf_fixedup_ready_for_tex_conversion.html

Converting test.pdf_fixedup_ready_for_tex_conversion.html to test.pdf_frompdf.tex
The resulting file is in test.pdf_frompdf.tex

Fixing up test.pdf_frompdf.tex to test.pdf_frompdf_fixedup.tex

Converting test.pdf_frompdf_fixedup.tex to test.pdf_frompdf_fixedup.pdf for inspection...

Opening test.pdf_frompdf_fixedup.pdf with Evince - you can try another PDF viewer if you like...

Script source code:

#!/bin/sh

file="$1"

dependencies="pdftohtml gnuhtml2latex pdflatex"

echo "PDF to LaTeX conversion script"

echo "Copyleft 2012 (c) Tom Van Braeckel <tomvanbraeckel@gmail.com>"

echo

echo "WARNING: this is a rudimentary first stab. Proceed with caution."

echo

if [ -z "$file" ]; then

echo "Usage: $0 <pdf file>"

echo "The resulting .tex file will be stored somewhere here."

exit 1

echo

echo "Checking if all dependencies are found..."

for dependency in $dependencies; do

which $dependency

if [ $? -ne 0 ]; then

echo "Dependency $dependency not found, install it using:"

echo "sudo apt-get install $dependency"

exit 1

else

echo "Dependency $dependency found."

done

echo

echo "Converting $file to ${file}s.html"

pdftohtml -nomerge "$file" "$file".html

echo

echo "Fixing up ${file}s.html to ${file}_fixedup.html..."

# This nasty br in a b causes problems later on

sed "s, , ,g" "${file}s.html" > "${file}_fixedup.html"

# ending with bold text menas it is the end of a title and can be on a newline

sed -i "s, \$, ,g" "${file}_fixedup.html"

# starting with bold text means it is the start of a title so can be on a new line

sed -i "s,^, ,g" "${file}_fixedup.html"

# spaces ?

sed -i "s,\ , ,g" "${file}_fixedup.html"

# there is no use in a space before a newline, and it causes a bogus indent when converting to .tex later on

sed -i "s, , ,g" "${file}_fixedup.html"

# encoding, although gnuhtml2latex ignores this

sed -i "s,<HEAD>,<HEAD><meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />," "${file}_fixedup.html"

echo

echo "Readying ${file}_fixedup.html for tex conversion in ${file}_fixedup_ready_for_tex_conversion.html"

cp "${file}_fixedup.html" "${file}_fixedup_ready_for_tex_conversion.html"

# br is ignored and should be replaced by a newline

sed -i "s, ,,g" "${file}_fixedup_ready_for_tex_conversion.html"

# Remove bogus links - this fixes the empty } problem

sed -i "s/<A name=[0-9]\+><\/a>//g" "${file}_fixedup_ready_for_tex_conversion.html"

echo

echo "Converting ${file}_fixedup_ready_for_tex_conversion.html to ${file}_frompdf.tex"

# -c = table of contents

# -s = write to standard out

# -p Break page after title / table of contents

# -H use hyperref package to process anchors

# -g images

# -n Use numbered sections

gnuhtml2latex -c -s -p -H -n "${file}_fixedup_ready_for_tex_conversion.html" > "$file"_frompdf.tex

echo "The resulting file is in ${file}_frompdf.tex"

echo

echo "Fixing up ${file}_frompdf.tex to ${file}_frompdf_fixedup.tex"

sed -i 's/\\par/\\newline/g' "${file}_frompdf.tex"

( cat header.inc ; tail -n +7 "${file}_frompdf.tex" ) > "${file}_frompdf_fixedup.tex"

echo

echo "Converting ${file}_frompdf_fixedup.tex to ${file}_frompdf_fixedup.pdf for inspection..."

pdflatex -interaction nonstopmode "${file}_frompdf_fixedup.tex" > tex_to_pdf_errors_and_warnings.txt

echo

echo "Opening ${file}_frompdf_fixedup.pdf with Evince - you can try another PDF viewer if you like..."

evince "$file"_frompdf_fixedup.pdf

The script uses one extra file, header.inc, which contains customizations:

\documentclass[a4paper,11pt,oneside]{article}

\usepackage{a4wide} % Iets meer tekst op een bladzijde

\usepackage[dutch]{babel} % Voor nederlandstalige hyphenatie (woordsplitsing) en het euro-symbool

\usepackage{amsmath} % Uitgebreide wiskundige mogelijkheden

\usepackage{amssymb} % Voor speciale symbolen zoals de verzameling Z, R...

\usepackage{url} % Om url's te verwerken

\usepackage{graphicx} % Om figuren te kunnen verwerken

\usepackage[small,bf,hang]{caption} % Om de captions wat te verbeteren

\usepackage{xspace} % Magische spaties na een commando

\usepackage[utf8]{inputenc} % Om niet ascii karakters rechtstreeks te kunnen typen

\usepackage{float} % Om nieuwe float environments aan te maken. Ook optie H!

\usepackage{flafter} % Opdat floats niet zouden voorsteken

\usepackage{listings} % Voor het weergeven van letterlijke text en codelistings

\usepackage{marvosym} % Om het euro symbool te krijgen

\usepackage{eurosym} % Om het euro symbool te krijgen

\usepackage{textcomp} % Voor onder andere graden celsius

\usepackage{fancyhdr} % Voor fancy headers en footers.

\usepackage{graphics} % Om figuren te verwerken.

\usepackage[a4paper,plainpages=false]{hyperref} % Om hyperlinks te hebben in het pdfdocument.

\usepackage[usenames,dvipsnames]{xcolor}

% Definitie algemene macro's

\newcommand{\npar}{\par \vspace{0.2ex }}

\setlength\textheight{9.75in}

\setlength\textwidth{7in}

\topmargin -0.5in

\headheight 0.0in

\oddsidemargin -.25in

[Update] You can also try the following method, using abiword, but the method above yields better results, in my opinion:

abiword --to=tex "filename.pdf"

Global Blind Spot

zondag 17 juni 2012

Rudimentary PDF to LaTeX conversion in Linux

zaterdag 2 juni 2012

When entring a room, do it eyes first rather than head...