Hathi Download Helper

An assistant to download books from Hathitrust.org.



Quickstart

  1. Copy the book URL or book ID into the URL field of Hathi Download Helper and press the 'get Book info' button.
  2. Select the source file format (pdf or images), set the destination folder and press the 'start download' button.
  3. Convert and merge the downloaded files to achieve a pdf book by pressing the 'create pdf' button.


Download resources

Source code and installer are available on:

WWW:  http://qt-apps.org/content/show.php/Hathi+Download+Helper?content=158702
WWW:  http://www.softpedia.com/get/Internet/Download-Managers/Hathi-Download-Helper.shtml

Comments, feedback, bug reports and questions are welcome:
hathidownloadhelper@hotmail.com


User interface elements

The main window of Hathi Download Helper is separated into three group boxes. Each group box corresponds to a certain processing step:
  • step 1: Identifying target book ID → Book information group box
  • step 2: Downloading book pages → Download settings group box
  • step 3: Converting downloaded data → PDF merge & conversion group box
Furthermore the menu bar provides additional features, namely:
  • Page setup
  • Gui setup (font setup, style setup)
  • User settings
  • Proxy settings
  • Batch job
  • Export links
  • About
  • Help
  • Update check
  • Auto update check

Menu Bar

The menu bar provides the following options:
  • File
  • → Exit : Exit Hathi Download Helper

  • Options
  • → Page Setup : Opens dialog for setting up page size (letter, A3, A4, etc.) and page margins (top,right, left, bottom)

  • Options
  • → GUI setup→Style setup : Opens dialog for GUI style setup. Available styles depend on your operating system.

  • Options
  • → GUI setup →Font setup : Opens dialog for GUI font setup. You can adjust font type, size, etc. The size of the Gui will be adjusted to the required space for all elements.

  • Options
  • → GUI setup →Default : Resets font and GUI style to default settings.

  • Options
  • → User settings : Opens dialog for default application settings.

  • Tools
  • → Proxy : Opens proxy server setup dialog. Here you can set up proxy IP, port number and proxy type. For authentication user name and password may be entered. Otherwise just leave those fields empty.

  • Tools
  • → Create batch job : Opens batch job dialog. With this dialog you can create a list of books which should be downloaded one after the other.

  • Tools
  • → Export links : Opens export dialog for creation of an html file containing links to each book page. You may like to use this feature to utilize your favorite download tool to download all book pages on your own.

  • Help
  • → Help : Opens the help dialog you are currently reading.

  • Help
  • → About : Hathi Download Helper about dialog.

  • Help
  • → About Qt : About Qt dialog.

  • Help
  • → Check for Update : Checks if a new version is available online and provides download links for source code and installer.
  • Help
  • → Automatically check for Update : Enable / Disable automatic update check on start-up.


Group Boxes


Book information

Book URL:

use proxy server

The 'book information' group box holds the URL input field as well as the received book information: Title, number of pages, book ID, publisher and author.
After entering the book URL Hathi Download Helper reads the html document after pushing the 'get Book info' button.
Alternatively the book ID can be entered also. If desired a proxy server could be used by selecting the corresponding checkbox. When the book is blocked, e.g. due to copy right restrictions, a message saying "Received empty document..." will be displayed close to the progress bar.


Download settings

pdfs

images

image zoom

download OCR text

pages

create pdf book after download

resume book download

enable WebProxies

In the 'Download settings' group box the user can choose between two file formats:
  • pdfs : select this option to download the book pages as single searchable pdf documents generated by Hathitrust.org. After download you have the option to merge all pdf files. For this operation Hathi Download Helper is utilizing 'pdftk' (see: PDF merge & conversion). Note: The download of pdf files is limited to approximately 15 files/ 5 mins.
  • images: select this option to download the book pages as image files (jpeg, png). The image quality depends on the selected resolution. The amount of files which you can download without a waiting time is much faster compared to pdf download.
  • The image quality can be adjusted by selecting a zoom factor. The listed dpi-values are approximations and depend on the selected page size.
  • To generate 'searchable' pdfs Hathi Download Helper has the option to download ocr text files in addition to the image files. The ocr text files will be stored as html documents on your hard disk.
  • Using the 'pages' input field the user can decide either to download a whole book or only certain pages. Single page numbers have to be separated by commas (e.g. 1,3,5). Page ranges have to be indicated by a hyphen, starting with the smaller value (e.g.: 5-10, 20-30).
  • Selecting the 'create pdf book after download' check box will automatically start the pdf merge and conversion process to generate pdf files or a single pdf book file.
  • When the 'resume book download' option is checked the Hathi Download Helper will check if there are already files of the specified book downloaded during a previous download session. By default is option is not checked and the downloader will re-download all files.
  • The 'enable WebProxies' check box will activate the build-in webproxy support of Hathi Download Helper. This feature will automatically generate download request via several webproxies to bypass the download limitations (e.g. for the pdf files) of hathitrust.org. Note: Only non-restricted books, which are accessible for non-us citizens, can be downloaded. See WebProxies for details.



PDF merge & conversion

merge pdfs

convert & merge images to pdf book

convert images to single pdf files

use plaintext (ocr text) only

set pdf resolution

In the 'PDF merge & conversion' group box the user can choose between the following options:
  • 'merge pdfs' : Merge single pdf files using the free tool 'pdftk' (http://www.pdflabs.com)
  • 'convert & merge images to pdf book': Convert and merge images to a pdf book. Page size and page margins are editable via 'Options' → 'Page setup'
  • convert images to single pdf files: Create single pdf files for each page.
  • Sets the output resolution for pdf files generated by Hathi Download Helper from images/ocr text files.



Features

This section holds some information about the file naming and folder structure used by Hathi Download Helper. Furthermore, you will find some explanations about Hathi Download Helper as PDF merger and Image-to-PDF converter.



WebProxy feature

Hathi Download Helper provides an option that utilizes a large amount of random webproxies to download data from hathitrust.org:
This feature will re-direct all download requests to free web proxy services to continue the download of data while the server download limitation for the user is activated. Please note the following information:

Restrictions:
• Works only for non-restricted books which are accessible for non-us citizens.
• Strongly varying download speed.

Important advice:
• Since this feature utilizes a large number of random webpages an updated virus scanner is recommented.
• There is no guarantee for proper functioning.





Network Proxy

Hathi Download Helper provides an option to enable a network proxy. For implementation the QNetworkProxy class of Qt 4.7.4 is used:
The following types are supported:

Proxy TypeDescriptionDefault capabilities
SOCKS_5Generic proxy for any kind of connection. Supports TCP, UDP, binding to a port (incoming connections) and authentication.TunnelingCapability, ListeningCapability, UdpTunnelingCapability, HostNameLookupCapability





File and folder structure


Hathi Download Helper creates the following sub-folder structure for downloaded data inside the target directory:
  • 'pdfs'
  • :Folder for downloaded pdf files
  • 'images'
  • :Folder for downloaded image files
  • 'ocr'
  • :Folder for downloaded ocr text files (*.hmtl)
Note: All downloaded data (images, pdfs, ocr files) will be kept. If you don't need them any more you have to delete them manually.

Note: When restarting a download (with the same book ID to the same destination folder) all files downloaded in the previous session will be overwritten unless you have selected the 'resume book download options'. In that case the downloader will check if a corresponding file (with the same name) already exists and will not download this file again.

Hathi Download Helper creates the following sub-folder structure for converted data inside the source directory:
  • 'pdfs'
  • :Folder for generated pdf files. Existing files will be overwritten.
  • 'pdfs_text_only'
  • :Folder for generated pdfs files with ocr text only.
Note: Since the target folder for download is the source folder for conversion all existing pdf files within the 'pdfs' folder will be overwriten when 'single pdf' conversion was selected as output option!





Namespace

Hathi Download Helper is using a fixed name structure for downloaded data, starting with the document ID (but with removed reserved characters).
This namespace is used for pdf files, images and ocr text files (html-files).
Example for document ID: hvd.32044038439063:

File formatID + "_page_" + page number + filetype extension

PDF example: hvd.32044038439063_page_001.pdf
JPG example: hvd.32044038439063_page_001.jpg
OCR example: hvd.32044038439063_page_001.html





Hathi Download Helper as PDF merger

Hathi Download Helper is able to merge any pdf files utilizing the 'pdftk' (pdf toolkit) application. For this purpose the radio button "merge pdfs" has to be selected. When selecting a folder without content downloaded by Hathi Download Helper (files/folders) a corresponding file dialog for file selection will apear. If you are running a linux or MAC OS system you have to install the 'pdftk' tool (http://www.pdflabs.com). For Windows systems Hathi Download Helper brings along a copy of 'pdftk'.





Hathi Download Helper as Image-to-PDF converter

Hathi Download Helper is able to convert a number of different image formats into pdf files. For this purpose the radio button "convert & merge images to pdf book" or "convert images to single pdf files" has to be selected. When selecting a folder without content downloaded by Hathi Download Helper (files/folders) a corresponding file dialog for file selection will apear."

Note: Since the target folder for download is the source folder for conversion all existing pdf files within the 'pdfs' folder will be overwriten when 'single pdf' conversion was selected as output option!





FAQ


  • What does the name "Hathi Download Helper" mean?
  • Hathi (pronounced hah-tee) is the Hindi word for elephant, an animal highly regarded for its capability to suck a huge amount of water into its trunk, and to blow the water into its mouth. In computer networks, to download means to receive data to a local system from a remote system, or to initiate such a data transfer. Helper refers to a device that helps. In combination, the words convey the key benefits users can expect from this application - to download pages or complete books in an easy way.


  • Server: maximum download limit exceeded...please wait....
  • There is a download limitation for any files by Hathitrust.org. When downloading too many files in a short period of time you will be forced to wait for some time. In case of pdf-files the limitation is about 15 files/ 5 minutes. Afterwards you have to wait for appr. 5 minutes. You may activate the WebProxy-Feature to download data via several webproxies during this queuing period.


  • Suddenly Hathitrust.org is not reachable anymore...
  • This behaviour may occur due to extensive download requests. In this case the user IP might be blocked by Hathitrust.org for apprx. 5 minutes.


  • Why are the created PDF files are so huge?
  • Hathi Download Helper uses a PDF-Printer (Qt::QPrinter), which 'prints' the images into the pdf file. Since QPrinter only supports jpg-image formats all pages are stored as jpg-images inside the pdf file. Therefore even pages with text only have to be stored in the same way like full resolution images.


  • How does Hathi Download Helper generate searchable PDF files?
  • Is there any OCR software involved?
  • Hathi Download Helper does not have any OCR functionality. Instead it uses the OCR files generated by Hathitrust.org. The downloaded OCR files are stored as html files on your hard disk. For PDF creation the OCR text will be printed on each page overlayed by the corresponding images.






    ERROR FIXING

  • "Error: unable to execute 'pdftk' application."
  • For merging existing pdf files Hathi Download Helper is using the 'pdftk' application. The error may occur due to missing permissions for the pdftk files. To fix this error you have to do the following actions in dependency of your OS:
  • Windows
    1. Download and install 'pdftk' from http://www.pdflabs.com
    2. Open the pdftk program folder and copy the files pdftk.exe and libiconv2.dll
    3. Open the Hathi Download Helper folder containing the hathidownloadhelper.exe file and create a new folder named pdftk
    4. Copy the files from step 2 into the pdftk subfolder.
    Hint:If you have compiled Hathi Download Helper on your own you have to place the pdftk subfolder in your Debug/Release target folder containing the HathiDownloadHelper.exe file.
  • Linux/MAC
    1. Download and install 'pdftk' from http://www.pdflabs.com or use the pdftk file placed in the pdftk subfolder attached to this project.

      E.g. when you are using Ubuntu you can install pdftk by the following command:
      sudo apt-get install pdftk





    Changelog

    2013.05.18:initial version 1.0.0
    2013.05.19:version 1.0.1 released:
    fixed bug in image resolution setting after 'page setup' dialog, renamed images files in qt resources, copied image files in application directory
    2013.05.24:version 1.0.2 released:
    changed development environment to 4.7.4, added compiler switch for qt 5.x, tested on linux and windows system, added options for GUI style and fonts, updated GUI, bug fix for missing ocr files, reduced freezing effect of GUI during pdf creation, added 'pdftk' binary for linux/OS, added selection for proxy type.
    2013.06.03:version 1.0.3 released:
    bug fix for proxy type selection. moved pdf merge & conversion into QThread worker to eliminate freezing effect of GUI during processing. Changed usage from QPixmap to QImage for pdf creation. Changed OCR text extraction method to reduce memory usage(QWebkit is really greedy). Improved text font size adjustment method. Added Author and Publisher information. Changed Windows installer creation from QT framework installer to inno setup compiler to fix kernel32.dll error on win XP.
    2013.07.02:Version 1.0.4 released:
    improved download performance by using parallel download requests (it is really much faster now :-D ), added encryption for proxy password, added 'check for update' feature, added batch job feature for downloading several books at once, added link export function
    2013.08.18:version 1.0.5 released:
    re-implementation of all GUI elements and dialogs, fixed text clipping of GUI elements, fixed page shrinking on pdf creation due to long ocr text, improved download speed, re-designed help file
    2013.10.27:version 1.0.6 released:
    bug fixes: lost destination path for single pdf-file creation, application crash on manual file selection. Added new features for batch job dialog: 'edit book', 'load job', 'save job', added gimmicks for Halloween and Christmas, minor changes.
    2014.03.30:version 1.0.7 released:
    added new download options: webproxies, resume of book downloads, added user settings dialog, added auto-update option, coding: separated GUI from file downloader.
    2014.05.06version 1.0.8 released:
    adjustments due to changes in hathitrust.org link structure.