Instead of single application for a proprietary file format, Unix utilities manage text streams. Text streams means not only text files but also command line inputs and outputs. Unix comes with several handy text processing utilities. These tools co-operates well with text streams. Therefore, you should save your documents in plain text formats whenever possible. Here we briefly introduce some of them.
Before we jump into these utilities, let’s look at regular expressions. Regular expressions are not standalone command utilities but a set of mini-language in many Unix utilities and programming languages. Regular expressions are compact search patterns for strings, saving a lot of conditions and loops. If you do not know regular expressions, you may still use these Unix utilities, but these utilities become more powerful with regular expressions.
There are several dialects of regular expressions in different Unix utilities, causing confusion. We suggest starting with the regular expressions of Perl, the most complete dialect of regular expressions. Check perlrequick, perlretut, and perlre for more information. A simpler way to practice and use Perl regular expressions in command line is using pcregrep
, a utility bundled with Perl Compatible Regular Expressions library.
Then, let’s go back to these text processing utilities. These utilities almost become parts of Unix; installation is seldom needed. Here we won’t cover all text processing utilities but some common ones. They are:
iconv
sed
tr
awk
split
andcsplit
head
andtail
nl
wc
perl
Using perl
mimicking Unix utilities is another application of Perl. The advantage is that you don’t need to memorize the usages of many utilities, but the alternative command in perl
is usually longer. See perlrun for details. There are also books discussing Perl one-liners, like Minimal Perl for Unix and Linux People, Manning and Perl One-Liners, No Starch Press.
iconv
converts text files from one character encoding to another character encoding. For example:
$ iconv -f iso-8859-1 -t utf-8 < infile > outfile
Perl comes with a utility called piconv
, which behaves like iconv
. It is handy for system that has no iconv
like Microsoft Windows.
sed
modify text streams and print out the result. sed
can be used with or without regular expressions. Normally, sed
doesn’t alter your file but print out the result to stand out. A simple usage of sed
is like this:
$ sed -i.bak 's/pattern/text/' file01 file02 file03 ...
In this case, -i
means in-place editing; original file will be saved in file01.bak, etc.
You may use perl
mimicking sed
:
$ perl -p -i.bak -e 's/pattern/text/;' file01 file02 file03 ...
tr
replaces strings in character-wise level. tr
doesn’t adapt regular expressions. To use tr
to covert uppercase letters to lowercase letters, do this:
$ tr "[:upper:]" "[:lower:]" < file
If you want to list all words in a file, use tr
to replace any characters other than alphabetic letters:
$ tr -c "[:alnum:]" "\n" < file
Again, you can substitute perl
for tr
:
$ perl -pe 'tr/[A-Z]/[a-z]/;' file
AWK is an interpreted programming language for data extraction and report generation. AWK is suitable for fast one-liners text processing. To list all users on system by AWK, do this:
$ awk -F':' '/^[^#]/ { print $1 }' /etc/passwd
Many features of AWK have been absorbed into Perl. To use perl
mimicking awk
in the same task, do this:
$ perl -a -F':' -nle 'next if /^#/; print $F[0];' /etc/passwd
split
and csplit
splits one file into several files by regex or line numbers. Since the behavior of split
and csplit
involves file I/O, there is no easy way to mimic csplit
with perl
one-liners.
To use csplit
to split a file, do this:
$ csplit file /pattern/
head
prints out the first several lines of a file. Similiarly, tail
prints out the last several lines of a file. If running with arguments, head
and tail
print out 10 lines of a file. To print the first 5 lines of a file, do this:
$ head -n5 file
It is also possible mimicking head
and tail
in perl
, but the command is longer.
$ perl -ne 'print if $. >= 0 && $. <= 5;' file
nl
calculates the line numbers in a file and print out the line numbers and the contents of the file. It is convienent if you need the line numbers.
Here is a longer example combining csplit
, nl
and perl
. We extract the line numbers of titles in the file and split the file by line numbers.
$ csplit file $(nl -ba -nln file | perl -a -nle 'print "$F[0]" if /pattern/;' \
| perl -ne ' chomp; push @a, $_; } END { print "@a";')
wc
is convienent for some basic statistics of files like character counts, word counts and line counts. Be aware of Unicode issue; there may be multibyte characters in files. An alternative program is uniwc
, a part of Unicode::Tussle Perl module.
There may be more commands and their useages in the text processing utilities of Unix, but I won’t dig too deeply. Consult system manual or online resources if you are interested in this topic. Good luck.