Lab 4: Counting Words

Posted on November 4, 2019

Getting Started

Test.Pull your repository (git pull). You should see two blank starter files in your repo: lab04/WordCount.hs and lab04/CollectPermutations.hs.

Counting Words

In this lab you will produce a sorted word count from input text. The input will be a text stream as in the rot13 program and you will output a list of words with their associated counts such that the words are sorted (String is in typeclass Ord and so that ordering is acceptable) and all in lower-case. E.g. the text:

A parser for things
Is a function from strings
To lists of pairs
Of things and strings

Should output:

a 2
and 1
for 1
from 1
function 1
is 1
lists 1
of 2
pairs 1
parser 1
strings 2
things 2
to 1

Implement this as a program with the exact name WordCount.hs such that it can be run as ./WordCount like the rot13 program. It should be in your lab4/ directory of your laboratory repository.

A word is defined as a contiguous sequence of letters with the quote character '\'' and the dash character '-'. That means "dog's" and "about-face" are each considered one word. Note that '\'':[] == "'".

So given the following on standard input:

Anything'''that-is-not45a1word##character-   is'inbetwe-en@`a$()wo.rd

Your code should produce:

a 2
anything'''that-is-not 1
character- 1
is'inbetwe-en 1
rd 1
wo 1
word 1

Collecting Permutations

Some words are permutations of each other: e.g. ["outs", "oust"] or ["portions", "positron"]. Write a second program which outputs the permutations in sorted order for those words which have multiple permutations—ignore words that appear once, or only appear as one permutation. E.g. given the text:

On the Outer Pyre steps the three purest usurpers swore to
their vile pursuers they were no prey, nor danger. The garden
route was worse from the original form they expected there as
the live volcano nearby often erupts with wide ranged lava on
the ground.

The program should output:

danger, garden, ranged
erupts, purest
form, from
live, vile
no, on
outer, route
prey, pyre
pursuers, usurpers
swore, worse
there, three

The words on a single line should be sorted as well as the lines themselves in the text output. The output should not contain any duplicate words. Use the definition of a word given above. This program should be implemented with the exact name CollectPermutations.hs in your lab4/ folder of your repository. It should take in text just like the rot13 program. Make sure CollectPermutations runs fast (less than a second). You cannot enumerate all permutations: it will become too slow too fast. There is a particular observation you might make about letters in permutations that will help you out considerably. Hint: does their order matter?

Standard Input

Unlike Lab 1, your executables will not take arguments. All input will arrive on standard input. But, unlike Lab 3 you should not read standard input line by line. Instead, read the entire input as one big string. The getContents function may be useful, also consult the rot13 exercise.

To actually run your program with input, you could just type standard input…

$ ./WordCount
hello world
are you there
why aren't you counting my words yet

…but you’ll discover that your program doesn’t do anything. Hit ctrl-d to indicate that you are done producing input. Then the program will count the words.

You can also pipe the output of one command to the standard input of your program:

$ echo "a a a b b" | ./WordCount
a 3
b 2

Or you can use a file as the standard input:

$ ./WordCount < my_file.txt

Line Breaks

You may have heard the word “newline” mentioned in class. It’s something so second-nature to your instructors that we may have neglected to explain it well.

Input like this:

line 1
line 2
line 3

Will become this string in Haskell:

"line 1\nline 2\nline 3"

The \n is a newline character. Even though it’s written \n, it’s actually a single character in the string. It indicates there should be a line break there. Having a character that says where to break lines is pretty important—it even has a whole Wikipedia article!

For this lab, you only need to know about newlines to realize that yes, your input is coming through correctly. And, you may decide you want to insert newlines into your output. For example, this prints a number followed by three blank lines:

putStr $ show n ++ "\n\n\n"

Using Library Functions

This lab is a bit different from the previous ones in that we do not want you to build very much from scratch. Part of learning to program well is using tools that are available to you. There are two primary tools that we will take advantage of.

Data.Map.Strict : The type Map k v is Haskell’s representation of collections of key-value pairs (keys of type k and values of type v) and are sometimes called dictionaries. You may have to import this module while hiding some functions, like import Data.Map.Strict hiding (filter, map). It differs for lists of pairs (i.e., [(k, v)]) in that it maintains that there is only one value for any given key and is implemented to make standard operations like lookups more efficient. There is also a large collection of useful functions for operating on Maps. Here are a couple functions that might be useful for this lab. Note that there is a function in this list that you are required to use.

singleton
toAscList
elems
unionsWith (REQUIRED) This is the only function that you are required to used in this lab. You need to use it in both problems. (hint. You should create a big collection of simple Maps and then combine them using unionsWith)

(>>>) : This is the binary operator for left-to-right composition in Control.Arrow. It is used to combine a sequence of operations in the order that they are presented. For example, if you want to perform a map followed by a filter followed by a sort, then you can write something like:

foo :: Ord b => (a -> b) -> (b -> Bool) -> [a] -> [b]
foo f g =
  map f    >>>
  filter g >>>
  sort

Since you will likely be performing a sequence of operations on the input string, this might be useful, and could potentially make your code very simple.

Here are a couple other functions that might be useful.

Note. Haskell does have a permutations function, however you should not use it because it is computationally explosive (consider: how many permutations might a 15-letter word have?).

To Hand In

Your lab4/ directory should have CollectPermutations.hs and WordCount.hs, each of which can be compiled to take in text input and output the desired text. If you choose to make a third file to share code between the two programs, e.g. WordSplitter.hs, then be sure to commit and push that as well.

Remember that your code must include a use of unionsWith in each file.

You do not need to write tests for this week’s lab.

Push your code to your repository by 11:59PM on Thursday 11/7.

Grading Breakdown

5% WordCount compiles. (ghc WordCount.hs does not error.)
5% CollectPermutations compiles. (ghc CollectPermutations.hs does not error.)
10% Clean code.
- Parentheses in the correct place: sqrt (5 * 2) not sqrt(5 * 2).
- Type definitions on all top-level functions.
- Understandable names. Naming is one of the two hard problems in computer science. Short names are fine for common uses: i for index, x:xs for generic array elements, n for generic number, c for counter, etc. Abbreviations are fine too, but the rule is a reader should know the meaning of the thing just by reading its name.
- Type definitions on all top-level functions.
- No commented-out code.
- Avoid too many long lines (>100 characters). A couple long lines are okay.
- Parentheses in the correct place: sqrt (5 * 2) not sqrt(5 * 2).
- Avoid unnecessary code, e.g. pattern match cases that could be combined, or deconstructing a value and then just putting it back together.
- No long, hard-to-follow, or deeply nested functions that should be broken up. Take advantage of where and let.
- Comments explaining any hard-to-understand logic, e.g. what does this helper function do?
- Tasteful use of whitespace for vertical alignment.
- η-reduction is not necessary, but can be tasteful.
40% WordCount works correctly.
- 30% Words counted correctly
- 5% Words lowercase
  - 5% Output sorted
40% CollectPermutations works correctly.
- 20% Correct permutation groups
- 5% Only permutations that occur at least twice
- 5% No duplicate words
- 5% Words lowercase
  - 5% Output sorted

Extra Credit

(≤ 25%) Write an extra program WordSegmenter.hs that uses /usr/share/dict/words or a SCOWL dictionary to segment input text. E.g. the input:

$ echo "itisappropriatetostartthestudyofmathematicallogicbyin
quiringwhatmathematicallogicis" | ./WordSegmenter

Should output a single solution:

it is appropriate to start the study of mathematical logic by
inquiring what mathematical logic is

Note the input will simply be one long line of only lowercase letters: no punctuation, no capitals, no whitespace.

You can start the second extra credit program with a file:

import System.IO
import WordSegmenterLib

main = do
    handle   <- openFile "/usr/share/dict/words" ReadMode
    contents <- hGetContents handle
    hClose handle

and you’ll need to add in logic to take input as well as implement a file WordSegmenterLib.hs which will contain the module WordSegmenterLib that contains your segmentation logic.

If you a dictionary other than /usr/share/dict/words, be sure to submit (and credit) the dictionary.

Note to get full extra credit on the segmenter your solution will have to have some intelligence (i.e. not just a greedy algorithm) because word segmentation can be ambiguous. Your segmenter must also run reasonably fast (within ~15 seconds).