flexget and pdf files

Posted by david marsh on Thu 01 March 2012

I've recently started looking at the awesome flexget and thought it would solve a problem with my sons school newsletter.

Like most schools, the one our son attends has a weekly newsletter to inform the parents of upcoming activities and events at the school. When he started the school gave us the option to either have a physical printout of it sent home with our child1, or we could just download the pdf from their website.

We opted-out of the printed newsletter and were happy to check the website for the pdf version of the newsletter.

As we're both busy and sometimes forget to check the website we'd occasionally miss things.

I wanted to download the pdf automatically when it appeared on the website and email it to us so we wouldn't have to remember to check for the latest version.

Initially I used wget called from cron to check the website like this:

/usr/bin/wget -r -l1 -N --no-verbose --continue --no-parent \
--no-directories --no-host-directories --reject html,htm,txt \
--accept .pdf -o /var/log/newsletters.log \
--directory-prefix=/srv/samba/newsletters \
http://www.quakershie-p.schools.nsw.edu.au/newsletters

I used the --continue flag so that it wouldn't download the same pdfs over and over. Even taking this into consideration this method still felt like a brute force approach.

(I won't go into it here, but I then use incron to look for changes to the /srv/samba/newsletters, which calls another script and emails the file as an attachment)

I like how flexget remembers what it has seen in a database and not download that file again. I thought this would solve the problem very eloquently and as I couldn't find much info about getting files from a URL automatically I thought I'd share my config here so others looking to do the same could benefit.

Here's my newsletter.yml config:

presets:
  global:
    free_space:
      path: /srv/samba/newsletters
      space: 1 #make sure there's Xgb free before downloading more
    domain_delay:
      www.quakershie-p.schools.nsw.edu.au: 10 seconds
    email:
      active: True
      from: davidmarsh
      to:
        - rdmarsh@gmail.com
feeds:
  newsletter:
    interval: 6 hours
    html:
      url: http://www.quakershie-p.schools.nsw.edu.au/newsletters/
      title_from: link
    regexp:
      accept:
        - quakers_whisper*
      rest: reject
    download: /srv/samba/newsletters

This will:

  • In the global section
    1. Check there's 1gb free on /srv/samba/newsletters (which is on the same disk as /)
    2. Wait 10 seconds between checking www.quakershie-p.schools.nsw.edu.au (even though there`s only one check, I wanted this here if I add more later)
    3. email me if it downloads something
  • In the feeds section:
    1. wait 6 hours between checks (if called sooner it will not run the check)
    2. use the url for checks
    3. name the downloaded files from their link title
    4. accept any file starting with "quakers_whisper" (the name of the newsletter)
    5. reject any other links it finds that don't match the above
    6. download the matching links to the /srv/samba/newsletters directory

I call it from cron with this command:

/usr/local/bin/flexget --cron -c /home/davidmarsh/.flexget/newsletter.yml

It doesn't really matter how often it runs as it will only actually hit the website every 6 hours due to the interval: 6 hours option in the yml file.2

(like before, I'm still using incron to call scripts to email the files)

Now we get an email with the newsletter attached within 6 hours of a new newsletter appearing on the schools website.


  1. Which may or may not make it home scrunched up in a small ball in the bottom of a school bag 

  2. Of course I could make it more frequent, but 6 hours seemed good enough. 

tags: linux