Tiago Madeira

Códigos aleatórios

Dump email addresses from files

Suppose you have a lot of .doc, .docx, .xls, .xlsx, .gz, .bz2, .pdf and text in general (.csv, .txt etc.) files and want to dump all the (unique) email addresses from them. How would you do it? Here is a simple solution I’ve just implemented (and probably didn’t test enough, so tell me if you find any bug):

#!/bin/sh
tmp=$(tempfile)
while [ $# -gt 0 ]; do
    if [ -r "$1" ]; then
        ext=$(echo ${1#*.} | tr [A-Z] [a-z])
        case $ext in
            docx | xlsx)
                # requires: http://blog.kiddaland.net/2009/07/antiword-for-office-2007/
                cat_open_xml "$1" >> $tmp
                ;;
            doc)
                # requires: antiword
                antiword "$1" >> $tmp
                ;;
            xls)
                # requires: catdoc
                xls2csv "$1" >> $tmp
                ;;
            gz)
                cat "$1" | gunzip >> $tmp
                ;;
            bz2)
                cat "$1" | bunzip2 >> $tmp
                ;;
            zip)
                unzip -p "$1" >> $tmp
                ;;
            pdf)
                # requires: xpdf-utils
                t=$(tempfile)
                pdftotext "$1" $t
                cat $t >> $tmp
                rm $t
                ;;
            *)
                text=$(file -b --mime-type "$1" | sed -e 's//.*//')
                if [ "z$text" = "ztext" ]; then
                    cat "$1" >> $tmp
                fi
                ;;
        esac
    fi
    shift
done
cat $tmp | grep -o -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}\b'
         | tr [A-Z] [a-z] | sort -u
rm $tmp

(the email regexp is explained here: regular-expressions.info/email.html)

Comentários

Nenhum comentário ainda.

Obrigado! Seu comentário foi enviado e será publicado quando for aprovado.

Infelizmente ocorreu um erro ao enviar seu comentário.