Arquivo da tag: emails

Dump email addresses from files

ATENÇÃO: Este conteúdo foi publicado há 5 anos. Eu talvez nem concorde mais com ele. Se é um post sobre tecnologia, talvez não faça mais sentido. Mantenho neste blog o que escrevo desde os 14 anos por motivos históricos. Leia levando isso em conta.

Suppose you have a lot of .doc, .docx, .xls, .xlsx, .gz, .bz2, .pdf and text in general (.csv, .txt etc.) files and want to dump all the (unique) email addresses from them. How would you do it? Here is a simple solution I’ve just implemented (and probably didn’t test enough, so tell me if you find any bug):

#!/bin/sh
tmp=$(tempfile)
while [ $# -gt 0 ]; do
    if [ -r "$1" ]; then
        ext=$(echo ${1#*.} | tr [A-Z] [a-z])
        case $ext in
            docx | xlsx)
                # requires: http://blog.kiddaland.net/2009/07/antiword-for-office-2007/
                cat_open_xml "$1" >> $tmp
                ;;
            doc)
                # requires: antiword
                antiword "$1" >> $tmp
                ;;
            xls)
                # requires: catdoc
                xls2csv "$1" >> $tmp
                ;;
            gz)
                cat "$1" | gunzip >> $tmp
                ;;
            bz2)
                cat "$1" | bunzip2 >> $tmp
                ;;
            zip)
                unzip -p "$1" >> $tmp
                ;;
            pdf)
                # requires: xpdf-utils
                t=$(tempfile)
                pdftotext "$1" $t
                cat $t >> $tmp
                rm $t
                ;;
            *)
                text=$(file -b --mime-type "$1" | sed -e 's//.*//')
                if [ "z$text" = "ztext" ]; then
                    cat "$1" >> $tmp
                fi
                ;;
        esac
    fi
    shift
done
cat $tmp | grep -o -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}\b' 
         | tr [A-Z] [a-z] | sort -u
rm $tmp

(the email regexp is explained here: regular-expressions.info/email.html)