Unix is Friend - Text Processinge

    Instead of single application for a proprietary file format, Unix utilities manage text streams. Text streams means not only text files but also command line inputs and outputs. Unix comes with several handy text processing utilities. These tools co-operates well with text streams. Therefore, you should save your documents in plain text formats whenever possible. Here we briefly introduce some of them.

    Before we jump into these utilities, let’s look at regular expressions. Regular expressions are not standalone command utilities but a set of mini-language in many Unix utilities and programming languages. Regular expressions are compact search patterns for strings, saving a lot of conditions and loops. If you do not know regular expressions, you may still use these Unix utilities, but these utilities become more powerful with regular expressions.

    There are several dialects of regular expressions in different Unix utilities, causing confusion. We suggest starting with the regular expressions of Perl, the most complete dialect of regular expressions. Check perlrequick, perlretut, and perlre for more information. A simpler way to practice and use Perl regular expressions in command line is using pcregrep, a utility bundled with Perl Compatible Regular Expressions library.

    Then, let’s go back to these text processing utilities. These utilities almost become parts of Unix; installation is seldom needed. Here we won’t cover all text processing utilities but some common ones. They are:

    • iconv
    • sed
    • tr
    • awk
    • split and csplit
    • head and tail
    • nl
    • wc
    • perl

    Using perl mimicking Unix utilities is another application of Perl. The advantage is that you don’t need to memorize the usages of many utilities, but the alternative command in perl is usually longer. See perlrun for details. There are also books discussing Perl one-liners, like Minimal Perl for Unix and Linux People, Manning and Perl One-Liners, No Starch Press.

    iconv converts text files from one character encoding to another character encoding. For example:

    $ iconv -f iso-8859-1 -t utf-8 < infile > outfile

    Perl comes with a utility called piconv, which behaves like iconv. It is handy for system that has no iconv like Microsoft Windows.

    sed modify text streams and print out the result. sed can be used with or without regular expressions. Normally, sed doesn’t alter your file but print out the result to stand out. A simple usage of sed is like this:

    $ sed -i.bak 's/pattern/text/' file01 file02 file03 ...

    In this case, -i means in-place editing; original file will be saved in file01.bak, etc.

    You may use perl mimicking sed:

    $ perl -p -i.bak -e 's/pattern/text/;' file01 file02 file03 ...

    tr replaces strings in character-wise level. tr doesn’t adapt regular expressions. To use tr to covert uppercase letters to lowercase letters, do this:

    $ tr "[:upper:]" "[:lower:]" < file

    If you want to list all words in a file, use tr to replace any characters other than alphabetic letters:

    $ tr -c "[:alnum:]" "\n" < file

    Again, you can substitute perl for tr:

    $ perl -pe 'tr/[A-Z]/[a-z]/;' file

    AWK is an interpreted programming language for data extraction and report generation. AWK is suitable for fast one-liners text processing. To list all users on system by AWK, do this:

    $ awk -F':' '/^[^#]/ { print $1 }' /etc/passwd

    Many features of AWK have been absorbed into Perl. To use perl mimicking awk in the same task, do this:

    $ perl -a -F':' -nle 'next if /^#/; print $F[0];' /etc/passwd

    split and csplit splits one file into several files by regex or line numbers. Since the behavior of split and csplit involves file I/O, there is no easy way to mimic csplit with perl one-liners.

    To use csplit to split a file, do this:

    $ csplit file /pattern/

    head prints out the first several lines of a file. Similiarly, tail prints out the last several lines of a file. If running with arguments, head and tail print out 10 lines of a file. To print the first 5 lines of a file, do this:

    $ head -n5 file

    It is also possible mimicking head and tail in perl, but the command is longer.

    $ perl -ne 'print if $. >= 0 && $. <= 5;' file

    nl calculates the line numbers in a file and print out the line numbers and the contents of the file. It is convienent if you need the line numbers.

    Here is a longer example combining csplit, nl and perl. We extract the line numbers of titles in the file and split the file by line numbers.

    $ csplit file $(nl -ba -nln file | perl -a -nle 'print "$F[0]" if /pattern/;' \
    | perl -ne ' chomp; push @a, $_; } END { print "@a";')

    wc is convienent for some basic statistics of files like character counts, word counts and line counts. Be aware of Unicode issue; there may be multibyte characters in files. An alternative program is uniwc, a part of Unicode::Tussle Perl module.

    There may be more commands and their useages in the text processing utilities of Unix, but I won’t dig too deeply. Consult system manual or online resources if you are interested in this topic. Good luck.

    Facebook Twitter LinkedIn LINE Skype EverNote GMail Yahoo Yahoo
    Facebook Facebook Twitter