Lab 4: Counting Words

Posted on October 28, 2021

Getting Started

Pull your repository (git pull). You should see two blank starter files in your repo: lab04/WordCount.hs and lab04/CollectPermutations.hs.

Counting Words

In this lab you will produce a sorted word count from input text. The input will be a text stream as in the rot13 program and you will output a list of words with their associated counts such that the words are sorted (String is in typeclass Ord and so that ordering is acceptable) and all in lower-case. E.g. the text:

A parser for things
Is a function from strings
To lists of pairs
Of things and strings

should output:

a 2
and 1
for 1
from 1
function 1
is 1
lists 1
of 2
pairs 1
parser 1
strings 2
things 2
to 1

Implement this as a program with the exact name WordCount.hs such that it can be run as ./WordCount like the rot13 program. It should be in the lab4 directory of your repository.

A word is defined as a contiguous sequence of letters with the quote character '\'' and the dash character '-'. That means "dog's" and "about-face" are each considered one word. Note that '\'':[] == "'".

So given the following on standard input:

Anything'''that-is-not45a1word##character-   is'inbetwe-en@`a$()wo.rd

Your code should produce:

a 2
anything'''that-is-not 1
character- 1
is'inbetwe-en 1
rd 1
wo 1
word 1

Collecting Permutations

Some words are permutations of each other: e.g. ["outs", "oust"] or ["portions", "positron"]. Write a second program which outputs the permutations in sorted order for those words which have multiple permutations—ignore words that appear once, or only appear as one permutation. E.g. given the text:

On the Outer Pyre steps the three purest usurpers swore to
their vile pursuers they were no prey, nor danger. The garden
route was worse from the original form they expected there as
the live volcano nearby often erupts with wide ranged lava on
the ground.

The program should output:

danger, garden, ranged
erupts, purest
form, from
live, vile
no, on
outer, route
prey, pyre
pursuers, usurpers
swore, worse
there, three

The words on a single line should be sorted as well as the lines themselves in the text output. The output should not contain any duplicate words. Use the definition of a word given above. This program should be implemented with the exact name CollectPermutations.hs in your lab4/ folder of your repository. It should take in text just like the rot13 program. Make sure CollectPermutations runs fast (less than a second). You cannot enumerate all permutations: it will become too slow too fast. There is a particular observation you might make about letters in permutations that will help you out considerably. Hint: does their order matter?

Standard Input

Unlike Lab 1, your executables will not take arguments. All input will arrive on standard input. But, unlike Lab 3 you should not read standard input line by line. Instead, read the entire input as one big string. The getContents function may be useful, also consult the rot13 exercise.

To actually run your program with input, you could just type standard input…

$ ./WordCount
hello world
are you there
why aren't you counting my words yet

…but you’ll discover that your program doesn’t do anything. Hit ctrl-d to indicate that you are done producing input. Then the program will count the words.

You can also pipe the output of one command to the standard input of your program:

$ echo "a a a b b" | ./WordCount
a 3
b 2

Or you can use a file as the standard input:

$ ./WordCount < my_file.txt

Line Breaks

You may have heard the word “newline” mentioned in class. It’s something so second-nature to your instructors that we may have neglected to explain it well.

Input like this:

line 1
line 2
line 3

Will become this string in Haskell:

"line 1\nline 2\nline 3"

The \n is a newline character. Even though it’s written \n, it’s actually a single character in the string. It indicates there should be a line break there. Having a character that says where to break lines is pretty important—it even has a whole Wikipedia article!

For this lab, you only need to know about newlines to realize that yes, your input is coming through correctly. And, you may decide you want to insert newlines into your output. For example, this prints a number followed by three blank lines:

putStr $ show n ++ "\n\n\n"

Data.Map

This lab is a bit different from the previous ones in that we don’t want you to build as much from scratch. Part of learning to program well is using tools that are available to you. In the case of this lab, you need to use the module Data.Map.Strict.

Something of type Map k v is a collection of key-value pairs, where the keys have type k and the values have type v. If you have programmed in another language, this should be familiar (e.g., dictionaries in Python). Maps differs from lists of pairs (i.e., [(k, v)]) in that they are unordered and maintain only one value for any given key. They are also implemented to make standard operations like lookups more efficient.

You should import this module qualified:

import qualified Data.Map.Strict as Map

since it contains functions that clash with Prelude function, like filter and map. To use a function f from this module, you have to refer to it as Map.f, .e.g, Map.empty is the empty map.

There is a large collection of functions for operating on Maps, but here are a couple that might be useful for this lab.

unionsWith is the only function that you are required to used in this lab. You need to use it in both problems. As far as evaluation goes, we will just check that it appears in your solution. That said, there is a natural way to use it that we will be expecting to see. We want you to think about this, examine the type of unionsWith and some examples. (Hint: You should create a big collection of simple Maps and then combine them using unionsWith. If you get stuck, skim the Wikipedia page on MapReduce…)

Useful Functions

Here are a couple other functions that might be useful.

Note. Haskell does have a permutations function, but you should not use it! It is computationally explosive (consider: how many permutations might a 15-letter word have?).

We also recommend taking a look at (>>>) . This is the binary operator for left-to-right composition in Control.Arrow. It is used to combine a sequence of operations in the order that they are presented. For example, if you want to perform a map followed by a filter followed by a sort, then you can write something like:

foo :: Ord b => (a -> b) -> (b -> Bool) -> [a] -> [b]
foo f g =
  map f    >>>
  filter g >>>
  sort

Since you will likely be performing a sequence of operations on the input string, this might be useful, and could potentially make your code very simple.

To Hand In

Your lab4 directory should have CollectPermutations.hs and WordCount.hs, each of which can be compiled to take in text input and output the desired text. If you choose to make a third file to share code between the two programs, e.g. WordSplitter.hs, then be sure to commit and push that as well.

Remember that your code must include a use of unionsWith in each file.

You do not need to write tests for this week’s lab.

Push your code to your repository by 4:59PM on Tuesday 11/2.

Grading Breakdown

5% WordCount compiles. (ghc WordCount.hs does not error.)
5% CollectPermutations compiles. (ghc CollectPermutations.hs does not error.)
10% Clean code.
- Parentheses in the correct place: sqrt (5 * 2) not sqrt(5 * 2).
- Type definitions on all top-level functions.
- Understandable names. Naming is one of the two hard problems in computer science. Short names are fine for common uses: i for index, x:xs for generic array elements, n for generic number, c for counter, etc. Abbreviations are fine too, but the rule is a reader should know the meaning of the thing just by reading its name.
- Type definitions on all top-level functions.
- No commented-out code.
- Avoid too many long lines (>100 characters). A couple long lines are okay.
- Parentheses in the correct place: sqrt (5 * 2) not sqrt(5 * 2).
- Avoid unnecessary code, e.g. pattern match cases that could be combined, or deconstructing a value and then just putting it back together.
- No long, hard-to-follow, or deeply nested functions that should be broken up. Take advantage of where and let.
- Comments explaining any hard-to-understand logic, e.g. what does this helper function do?
- Tasteful use of whitespace for vertical alignment.
- η-reduction is not necessary, but can be tasteful.
40% WordCount works correctly.
- 30% Words counted correctly
- 5% Words lowercase
  - 5% Output sorted
40% CollectPermutations works correctly.
- 20% Correct permutation groups
- 5% Only permutations that occur at least twice
- 5% No duplicate words
- 5% Words lowercase
  - 5% Output sorted

Extra Credit

(≤ 25%) Write an extra program WordSegmenter.hs that uses /usr/share/dict/words or a SCOWL dictionary to segment input text. E.g. the input:

$ echo "itisappropriatetostartthestudyofmathematicallogicbyin
quiringwhatmathematicallogicis" | ./WordSegmenter

Should output a single solution:

it is appropriate to start the study of mathematical logic by
inquiring what mathematical logic is

Note the input will simply be one long line of only lowercase letters: no punctuation, no capitals, no whitespace.

You can start the extra credit program with a file:

import System.IO
import WordSegmenterLib

main = do
    handle   <- openFile "/usr/share/dict/words" ReadMode
    contents <- hGetContents handle
    hClose handle

and you’ll need to add in logic to take input as well as implement a file WordSegmenterLib.hs which will contain the module WordSegmenterLib that contains your segmentation logic.

If you a dictionary other than /usr/share/dict/words, be sure to submit (and credit) the dictionary.

Note to get full extra credit on the segmenter your solution will have to have some intelligence (i.e. not just a greedy algorithm) because word segmentation can be ambiguous. Your segmenter must also run reasonably fast (within ~10 seconds).