I finally got around to trying a rudimentary PDF to LaTeX conversion in Linux.
"It's like turning a hamburger into a cow" :-)
Usage:
./pdftolatex.sh "filename.pdf"
Example output:
# pdftolatex.sh test.pdf
PDF to LaTeX conversion script
Copyleft 2012 (c) Tom Van Braeckel <tomvanbraeckel@gmail.com>
WARNING: this is a rudimentary first stab. Proceed with caution.
Checking if all dependencies are found...
/usr/bin/pdftohtml
Dependency pdftohtml found.
/usr/bin/gnuhtml2latex
Dependency gnuhtml2latex found.
/usr/bin/pdflatex
Dependency pdflatex found.
Converting test.pdf to test.pdfs.html
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11
Fixing up test.pdfs.html to test.pdf_fixedup.html...
Readying test.pdf_fixedup.html for tex conversion in test.pdf_fixedup_ready_for_tex_conversion.html
Converting test.pdf_fixedup_ready_for_tex_conversion.html to test.pdf_frompdf.tex
The resulting file is in test.pdf_frompdf.tex
Fixing up test.pdf_frompdf.tex to test.pdf_frompdf_fixedup.tex
Converting test.pdf_frompdf_fixedup.tex to test.pdf_frompdf_fixedup.pdf for inspection...
Opening test.pdf_frompdf_fixedup.pdf with Evince - you can try another PDF viewer if you like...
Script source code:
#!/bin/sh
file="$1"
dependencies="pdftohtml gnuhtml2latex pdflatex"
echo "PDF to LaTeX conversion script"
echo "Copyleft 2012 (c) Tom Van Braeckel <tomvanbraeckel@gmail.com>"
echo
echo "WARNING: this is a rudimentary first stab. Proceed with caution."
echo
if [ -z "$file" ]; then
echo "Usage: $0 <pdf file>"
echo "The resulting .tex file will be stored somewhere here."
exit 1
fi
echo
echo "Checking if all dependencies are found..."
for dependency in $dependencies; do
which $dependency
if [ $? -ne 0 ]; then
echo "Dependency $dependency not found, install it using:"
echo "sudo apt-get install $dependency"
exit 1
else
echo "Dependency $dependency found."
fi
done
echo
echo "Converting $file to ${file}s.html"
pdftohtml -nomerge "$file" "$file".html
echo
echo "Fixing up ${file}s.html to ${file}_fixedup.html..."
# This nasty br in a b causes problems later on
sed "s,<br/></b>,</b><br/>,g" "${file}s.html" > "${file}_fixedup.html"
# ending with bold text menas it is the end of a title and can be on a newline
sed -i "s,</b><br/>\$,</b><br/><br/>,g" "${file}_fixedup.html"
# starting with bold text means it is the start of a title so can be on a new line
sed -i "s,^<b>,<br/><b>,g" "${file}_fixedup.html"
# spaces ?
sed -i "s,\ , ,g" "${file}_fixedup.html"
# there is no use in a space before a newline, and it causes a bogus indent when converting to .tex later on
sed -i "s, <br/>,<br/>,g" "${file}_fixedup.html"
# encoding, although gnuhtml2latex ignores this
sed -i "s,<HEAD>,<HEAD><meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />," "${file}_fixedup.html"
echo
echo "Readying ${file}_fixedup.html for tex conversion in ${file}_fixedup_ready_for_tex_conversion.html"
cp "${file}_fixedup.html" "${file}_fixedup_ready_for_tex_conversion.html"
# br is ignored and should be replaced by a newline
sed -i "s,<br/>,<p></p>,g" "${file}_fixedup_ready_for_tex_conversion.html"
# Remove bogus links - this fixes the empty } problem
sed -i "s/<A name=[0-9]\+><\/a>//g" "${file}_fixedup_ready_for_tex_conversion.html"
echo
echo "Converting ${file}_fixedup_ready_for_tex_conversion.html to ${file}_frompdf.tex"
# -c = table of contents
# -s = write to standard out
# -p Break page after title / table of contents
# -H use hyperref package to process anchors
# -g images
# -n Use numbered sections
gnuhtml2latex -c -s -p -H -n "${file}_fixedup_ready_for_tex_conversion.html" > "$file"_frompdf.tex
echo "The resulting file is in ${file}_frompdf.tex"
echo
echo "Fixing up ${file}_frompdf.tex to ${file}_frompdf_fixedup.tex"
sed -i 's/\\par/\\newline/g' "${file}_frompdf.tex"
( cat header.inc ; tail -n +7 "${file}_frompdf.tex" ) > "${file}_frompdf_fixedup.tex"
echo
echo "Converting ${file}_frompdf_fixedup.tex to ${file}_frompdf_fixedup.pdf for inspection..."
pdflatex -interaction nonstopmode "${file}_frompdf_fixedup.tex" > tex_to_pdf_errors_and_warnings.txt
echo
echo "Opening ${file}_frompdf_fixedup.pdf with Evince - you can try another PDF viewer if you like..."
evince "$file"_frompdf_fixedup.pdf
The script uses one extra file, header.inc, which contains customizations:
\documentclass[a4paper,11pt,oneside]{article}
\usepackage{a4wide} % Iets meer tekst op een bladzijde
\usepackage[dutch]{babel} % Voor nederlandstalige hyphenatie (woordsplitsing) en het euro-symbool
\usepackage{amsmath} % Uitgebreide wiskundige mogelijkheden
\usepackage{amssymb} % Voor speciale symbolen zoals de verzameling Z, R...
\usepackage{url} % Om url's te verwerken
\usepackage{graphicx} % Om figuren te kunnen verwerken
\usepackage[small,bf,hang]{caption} % Om de captions wat te verbeteren
\usepackage{xspace} % Magische spaties na een commando
\usepackage[utf8]{inputenc} % Om niet ascii karakters rechtstreeks te kunnen typen
\usepackage{float} % Om nieuwe float environments aan te maken. Ook optie H!
\usepackage{flafter} % Opdat floats niet zouden voorsteken
\usepackage{listings} % Voor het weergeven van letterlijke text en codelistings
\usepackage{marvosym} % Om het euro symbool te krijgen
\usepackage{eurosym} % Om het euro symbool te krijgen
\usepackage{textcomp} % Voor onder andere graden celsius
\usepackage{fancyhdr} % Voor fancy headers en footers.
\usepackage{graphics} % Om figuren te verwerken.
\usepackage[a4paper,plainpages=false]{hyperref} % Om hyperlinks te hebben in het pdfdocument.
\usepackage[usenames,dvipsnames]{xcolor}
% Definitie algemene macro's
\newcommand{\npar}{\par \vspace{0.2ex }}
\setlength\textheight{9.75in}
\setlength\textwidth{7in}
\topmargin -0.5in
\headheight 0.0in
\oddsidemargin -.25in
[Update] You can also try the following method, using abiword, but the method above yields better results, in my opinion:
abiword --to=tex "filename.pdf"