Lab 4: Counting Words

Posted on October 30, 2017

Announcements

Back to our regular schedule. Today’s lab is due this week Thursday before 11pm (Nov 2).

Getting Started

Pull your repository (git pull). You should see two blank starter files in your repo: lab4/WordCount.hs and lab4/CollectPermutations.hs.

Counting Words

In this lab you will produce a sorted word count from input text. The input will be a text stream as in the rot13 program and you will output a list of words with their associated counts such that the words are sorted (String is in typeclass Ord and so that ordering is acceptable) and all in lower-case. E.g. the text:

A parser for things
Is a function from strings
To lists of pairs
Of things and strings

Should output:

a 2
and 1
for 1
from 1
function 1
is 1
lists 1
of 2
pairs 1
parser 1
strings 2
things 2
to 1

Implement this as a program with the exact name WordCount.hs such that it can be run as ./WordCount like the rot13 program. It should be in your lab4/ directory of your laboratory repository.

A word is defined as a contiguous sequence of letters with the quote character '\'' and the dash character '-'. That means "dog's" and "about-face" are each considered one word. Note that '\'':[] == "'".

So given the following on standard input:

Anything'''that-is-not45a1word##character-   is'inbetwe-en@`a$()wo.rd

Your code should produce:

a 2
anything'''that-is-not 1
character- 1
is'inbetwe-en 1
rd 1
wo 1
word 1

Collecting Permutations

Some words are permutations of each other: e.g. ["outs", "oust"] or ["portions", "positron"]. Write a second program which outputs the permutations in sorted order for those words which have multiple permutations—ignore words that appear once, or only appear as one permutation. E.g. given the text:

On the Outer Pyre steps the three purest usurpers swore to
their vile pursuers they were no prey, nor danger. The garden
route was worse from the original form they expected there as
the live volcano nearby often erupts with wide ranged lava on
the ground.

The program should output:

danger, garden, ranged
erupts, purest
form, from
live, vile
no, on
outer, route
prey, pyre
pursuers, usurpers
swore, worse
there, three

The words on a single line should be sorted as well as the lines themselves in the text output. The output should not contain any duplicate words. Use the definition of a word given above. This program should be implemented with the exact name CollectPermutations.hs in your lab4/ folder of your repository. It should take in text just like the rot13 program. Make sure CollectPermutations runs fast (less than a second). You cannot enumerate all permutations: it will become too slow too fast.

Standard Input

Unlike Lab 1, your executables will not take arguments. All input will arrive on standard input. But, unlike Lab 3 you should not read standard input line by line. Instead, read the entire input as one big string. The getContents function may be useful, also consult the rot13 exercise.

To actually run your program with input, you could just type standard input…

$ ./WordCount
hello world
are you there
why aren't you counting my words yet

…but you’ll discover that your program doesn’t do anything. Hit ctrl-d to indicate that you are done producing input. Then the program will count the words.

You can also pipe the output of one command to the standard input of your program:

$ echo "a a a b b" | ./WordCount
a 3
b 2

Or you can use a file as the standard input:

$ ./WordCount < my_file.txt

Line Breaks

You may have heard the word “newline” mentioned in class. It’s something so second-nature to your instructors that we may have neglected to explain it well.

Input like this:

line 1
line 2
line 3

Will become this string in Haskell:

"line 1\nline 2\nline 3"

The \n is a newline character. Even though it’s written \n, it’s actually a single character in the string. It indicates there should be a line break there. Having a character that says where to break lines is pretty important—it even has a whole Wikipedia article!

For this lab, you only need to know about newlines to realize that yes, your input is coming through correctly. And, you may decide you want to insert newlines into your output. For example, this prints a number followed by three blank lines:

putStr $ show n ++ "\n\n\n"

Useful Functions

To perform the sort you may want to use the Ord typeclass:

instance Ord  ....  where
   ... <= ... = ...

And here are some library functions that appear in your instructor’s solution.

Haskell does have a permutations function, however you should not use it because it is computationally explosive (consider: how many permutations might a 15-letter word have?).

To Hand In

Your lab4/ directory should have CollectPermutations.hs and WordCount.hs, each of which can be compiled to take in text input and output the desired text.

You do not need to write tests for this week’s lab.

Use Git to submit. As always, check Gitlab to confirm that your changes are there.

Grading Breakdown

5% WordCount compiles. (ghc WordCount.hs does not error.)
5% CollectPermutations compiles. (ghc CollectPermutations.hs does not error.)
10% Clean code.
- Parentheses in the correct place: sqrt (5 * 2) not sqrt(5 * 2).
- Type definitions on all top-level functions.
- Understandable names. Naming is one of the two hard problems in computer science. Short names are fine for common uses: i for index, x:xs for generic array elements, n for generic number, c for counter, etc. Abbreviations are fine too, but the rule is a reader should know the meaning of the thing just by reading its name.
- Type definitions on all top-level functions.
- No commented-out code.
- Avoid too many long lines (>100 characters). A couple long lines are okay.
- Parentheses in the correct place: sqrt (5 * 2) not sqrt(5 * 2).
- Avoid unnecessary code, e.g. pattern match cases that could be combined, or deconstructing a value and then just putting it back together.
- No long, hard-to-follow, or deeply nested functions that should be broken up. Take advantage of where and let.
- Comments explaining any hard-to-understand logic, e.g. what does this helper function do?
- Tasteful use of whitespace for vertical alignment.
- η-reduction is not necessary, but can be tasteful.
40% WordCount works correctly.
- 30% Words counted correctly
- 5% Words lowercase
- 5% Output sorted
40% CollectPermutations works correctly.
- 20% Correct permutation groups
- 5% Only permutations that occur at least twice
- 5% No duplicate words
- 5% Words lowercase
- 5% Output sorted

Extra Credit

Two extra credit assignments:

(15%) Use a trie as the internal data structure for the word counts. Do not turn in an extra solution, simply use a trie in WordCount.hs. (If you first build the word counter without a trie, be sure to commit before attempting with a trie so you can revert back if needed.)

(≤ 30%) Write an extra program WordSegmenter.hs that uses /usr/share/dict/words or a SCOWL dictionary to segment input text. E.g. the input:

$ echo "itisappropriatetostartthestudyofmathematicallogicbyin
quiringwhatmathematicallogicis" | ./WordSegmenter

Should output a single solution:

it is appropriate to start the study of mathematical logic by
inquiring what mathematical logic is

Note the input will simply be one long line of only lowercase letters: no punctuation, no capitals, no whitespace.

You can start the second extra credit program with a file:

import System.IO
import WordSegmenterLib

main = do
    handle   <- openFile "/usr/share/dict/words" ReadMode
    contents <- hGetContents handle
    hClose handle

and you’ll need to add in logic to take input as well as implement a file WordSegmenterLib.hs which will contain the module WordSegmenterLib that contains your segmentation logic.

If you a dictionary other than /usr/share/dict/words, be sure to submit (and credit) the dictionary.

Note to get full extra credit on the segmenter your solution will have to have some intelligence (i.e. not just a greedy algorithm) because word segmentation can be ambiguous.