Script para baixar documentos do Issuu no GNU/Linux

26 de setembro de 2014

Tive necessidade de baixar um documento do Issuu. Segue um script simples que escrevi para baixar as páginas, convertê-las para PDF e mesclá-las. Ele não tem checagem de erros, mas pode ser útil para mais pessoas:

#!/bin/bash

if [ $# -lt 1 ]; then
    echo "Uso: $0 <endereco_do_documento_no_issuu>"
    exit
fi

tmp=$(mktemp -d)

echo "Baixando pagina HTML..."
wget -q "$1" -O $tmp/html

pageCount=$(cat $tmp/html | grep -o '"pageCount":[0-9]*' | sed 's/.*://')
model=$(cat $tmp/html | grep 'image_src' | sed 's/.*href="//; s/".*//')
title=$(cat $tmp/html | grep '<title>' | sed 's/.*<title>//; s/<\/title>.*//')

echo "-> Encontrado documento de $pageCount paginas"
echo "-> Primeira pagina: $model"

for i in $(seq 1 $pageCount); do
    download=$(echo $model | sed "s/page_1/page_$i/")
    echo "Baixando pagina ${i}..."
    wget -q "$download" -O "$tmp/page_${i}.jpg"
done

echo "Convertendo paginas JPG -> PDF..."
for i in $(ls $tmp/*.jpg); do
    convert "$i" "${i}.pdf"
done

echo "Mesclando paginas..."
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile="${title}.pdf" $tmp/page_*.pdf
rm -rf $tmp

echo "-> Pronto: '${title}.pdf'"

O script requer Bash, wget, GhostScript e ImageMagick. A maioria das distribuições de Linux já tem esses aplicativos, mas por via das dúvidas cheque se você tem o ImageMagick instalado.

Download do script: issuu_download.sh (932 bytes)

Para instalar, é só baixar o arquivo, torná-lo executável e movê-lo para alguma pasta do seu $PATH:

$ wget https://tiagomadeira.com/wp-content/uploads/2014/09/issuu_download.sh
$ chmod +x issuu_download.sh
$ sudo mv issuu_download.sh /bin

Para usar, é só digitar:

$ issuu_download.sh <endereco_do_documento>

Tags:

Dump email addresses from files

3 de maio de 2012

Suppose you have a lot of .doc, .docx, .xls, .xlsx, .gz, .bz2, .pdf and text in general (.csv, .txt etc.) files and want to dump all the (unique) email addresses from them. How would you do it? Here is a simple solution I’ve just implemented (and probably didn’t test enough, so tell me if you find any bug):

#!/bin/sh
tmp=$(tempfile)
while [ $# -gt 0 ]; do
    if [ -r "$1" ]; then
        ext=$(echo ${1#*.} | tr [A-Z] [a-z])
        case $ext in
            docx | xlsx)
                # requires: http://blog.kiddaland.net/2009/07/antiword-for-office-2007/
                cat_open_xml "$1" >> $tmp
                ;;
            doc)
                # requires: antiword
                antiword "$1" >> $tmp
                ;;
            xls)
                # requires: catdoc
                xls2csv "$1" >> $tmp
                ;;
            gz)
                cat "$1" | gunzip >> $tmp
                ;;
            bz2)
                cat "$1" | bunzip2 >> $tmp
                ;;
            zip)
                unzip -p "$1" >> $tmp
                ;;
            pdf)
                # requires: xpdf-utils
                t=$(tempfile)
                pdftotext "$1" $t
                cat $t >> $tmp
                rm $t
                ;;
            *)
                text=$(file -b --mime-type "$1" | sed -e 's//.*//')
                if [ "z$text" = "ztext" ]; then
                    cat "$1" >> $tmp
                fi
                ;;
        esac
    fi
    shift
done
cat $tmp | grep -o -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}\b'
         | tr [A-Z] [a-z] | sort -u
rm $tmp

(the email regexp is explained here: regular-expressions.info/email.html)

Tags: