PDF Handler (pdfh)

Examine all PDF files in lookup directories, identify them using regular expressions, rename them, and copy them to organized directories.

Installation

gem install pdfh

Dependencies

You need to install pdftotext to extract text from PDF files.

macOS

brew install xpdf

Fedora

sudo dnf install -y poppler-utils

Arch

sudo pacman -S poppler

Usage

After installing this gem, create your configuration file in one of the following directories:

~/.config/pdfh.yml
~/pdfh.yml
or configure the PDFH_CONFIG_FILE environment variable

Then run:

pdfh

The tool will:

Scan all PDFs in the configured lookup_dirs
Extract text from each PDF using pdftotext
Match the extracted text from each PDF against your configured document_types (via re_id)
Copy matched documents to organized directories within destination_base_path
Rename files according to your name_template

Configuration

Example configuration:

---
lookup_dirs:                   # Directories where all PDFs will be analyzed
  - ~/Downloads
destination_base_path: ~/PDFs  # Directory where all matching documents will be copied (MUST exist)
document_types:
  - name: My Bank                         # Description (type)
    re_id: 'Account ID: 12334-\w{3}'      # [OPTIONAL (uses name as fallback)] RegEx to match from PDF content as document identifier
    re_date: '\d{1,2} de (\w+) de (\d+)'  # Date RegEx (to extract from PDF content)
    store_path: "{year}/bank_docs"        # Relative path to copy this document
    name_template: '{period} {name}'      # [OPTIONAL] Template for new filename when copied

Placeholders

Store Path and Name Template support the following placeholders:

Placeholder	Description	Example
`{original}`	Original filename	`MyBankDocument2.pdf`
`{period}`	Year-Month	`2022-07`
`{year}`	Year	`2022`
`{month}`	Month	`07`
`{day}`	Day (if captured)	`01`
`{quarter}`	Quarter (Q1-Q4)	`Q3`
`{bimester}`	Bimester (B1-B6)	`B4`
`{name}`	Document type name	`My Bank`

The period, year, month, day, quarter and bimester placeholders are calculated from the date captured by the re_date regular expression.

Date Extraction Examples

The re_date regex extracts date information from the PDF content:

Date text	RegEx	Captured
`01/02/2025`	`(?<d>\d{2})\/(?<m>\d{2})\/(?<y>\d{4})`	d: `01` m: `02` y: `2025`
`072025 -`	`(?<m>\d{2})(?<y>\d{4}) -`	m: `07` y: `2025`
`31 de julio de 2025`	`\d{1,2} de (\w+) de (\d+)`	month: `julio` year: `2025`

Named captures supported: y for year, m for month, d for day.

If named captures are not used, the regex groups will be matched in order: month, year.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run rake install. To release a new version, run rake bump, and then run rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

rake install

# step by step
build pdfh.gemspec
gem install pdfh-*

To release a new version, run:

rake bump
rake release

This will create a git tag for the version, push git commits and tags, and upload the .gem file to rubygems.org.

Conventional Commits

npm install -g @commitlint/cli @commitlint/config-conventional
commitlint --from origin --to @

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/iax7/pdfh. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the Pdfh project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.

Command Options

Run with verbose output:

pdfh -v

Run in dry-run mode (no files will be moved):

pdfh --dry

Show version:

pdfh --version

pdfh

Runtime