PDF Handler (pdfh)
Examine all PDF files in lookup directories, identify them using regular expressions, rename them, and copy them to organized directories.
Installation
gem install pdfhDependencies
You need to install pdftotext to extract text from PDF files.
macOS
brew install xpdfFedora
sudo dnf install -y poppler-utilsArch
sudo pacman -S popplerUsage
After installing this gem, create your configuration file in one of the following directories:
~/.config/pdfh.yml~/pdfh.yml- or configure the
PDFH_CONFIG_FILEenvironment variable
Then run:
pdfhThe tool will:
- Scan all PDFs in the configured
lookup_dirs - Extract text from each PDF using
pdftotext - Match the extracted text from each PDF against your configured
document_types(viare_id) - Copy matched documents to organized directories within
destination_base_path - Rename files according to your
name_template
Configuration
Example configuration:
---
lookup_dirs: # Directories where all PDFs will be analyzed
- ~/Downloads
destination_base_path: ~/PDFs # Directory where all matching documents will be copied (MUST exist)
document_types:
- name: My Bank # Description (type)
re_id: 'Account ID: 12334-\w{3}' # [OPTIONAL (uses name as fallback)] RegEx to match from PDF content as document identifier
re_date: '\d{1,2} de (\w+) de (\d+)' # Date RegEx (to extract from PDF content)
store_path: "{year}/bank_docs" # Relative path to copy this document
name_template: '{period} {name}' # [OPTIONAL] Template for new filename when copiedPlaceholders
Store Path and Name Template support the following placeholders:
| Placeholder | Description | Example |
|---|---|---|
{original} |
Original filename | MyBankDocument2.pdf |
{period} |
Year-Month | 2022-07 |
{year} |
Year | 2022 |
{month} |
Month | 07 |
{day} |
Day (if captured) | 01 |
{quarter} |
Quarter (Q1-Q4) | Q3 |
{bimester} |
Bimester (B1-B6) | B4 |
{name} |
Document type name | My Bank |
The period, year, month, day, quarter and bimester placeholders are calculated from the date captured by the re_date regular expression.
Date Extraction Examples
The re_date regex extracts date information from the PDF content:
| Date text | RegEx | Captured |
|---|---|---|
01/02/2025 |
(?<d>\d{2})\/(?<m>\d{2})\/(?<y>\d{4}) |
d: 01 m: 02 y: 2025
|
072025 - |
(?<m>\d{2})(?<y>\d{4}) - |
m: 07 y: 2025
|
31 de julio de 2025 |
\d{1,2} de (\w+) de (\d+) |
month: julio year: 2025
|
Named captures supported: y for year, m for month, d for day.
If named captures are not used, the regex groups will be matched in order: month, year.
Development
After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run rake install. To release a new version, run rake bump, and then run rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.
rake install
# step by step
build pdfh.gemspec
gem install pdfh-*To release a new version, run:
rake bump
rake releaseThis will create a git tag for the version, push git commits and tags, and upload the .gem file to rubygems.org.
Conventional Commits
npm install -g @commitlint/cli @commitlint/config-conventional
commitlint --from origin --to @Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/iax7/pdfh. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
The gem is available as open source under the terms of the MIT License.
Code of Conduct
Everyone interacting in the Pdfh project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.
Command Options
Run with verbose output:
pdfh -vRun in dry-run mode (no files will be moved):
pdfh --dryShow version:
pdfh --version