Some elementary text processing with unix:
I downloaded the archives from the usenet users group comp.ai.alfie for
94-09-20. There were 65 individual files that I concatenated with the
command:
cat * bigfile
tr command (translate)
tr 'aeiou' 'x' <bigfile >xfile
will create a file called xfile where all occurences of the lower case
letters 'a','e','i','o','u' will be turned into the letter 'x'.
tr -c 'aeiou' 'x' <bigfile >yfile
will create a file where all occurences of any bytes unequal to the
letters 'a','e','i','o','u' will become an 'x'.
tr -c 'A-Za-z' '\012' <bigfile
>zfile will create a file called zfile where everything other
than a letter will be converted to a new-line character. This will
cause multiple new-line characters. To get a file with a new word on a
single line, type
tr -cs 'A-Za-z' '\012' <bigfile
>wfile
Let's convert all upper case to lower case:
tr 'A-Z' 'a-z' <wfile >lowerfile
Now we sort the file, pipe it to uniq with a count and sort
again in descending order:
sort lowerfile|uniq -c|sort -nr >countlist
The file countlist will contain a word count of the original
file in descending order