PA 4 — Due: Fri Oct 27 11pm CT
Recall the academic
integrity policy: turn in your own work and cite all sources.
It’s time to do some coding! Haskell coding, of course, but also barcoding. I’ve always wanted to learn, so here’s my chance — and yours to further develop your data structure manipulation skills using the facilities in Haskell. The particular code we will code is Code 128. When finished, your program will:
Encode strings as barcode bitstrings. For example:
% cabal run code128 -- encodeB Wikipedia
11010010000111010001101000011010011000010010100001101001010011110010110010000100001001101000011010010010110000111100100101100011101011
Decode barcode bitstrings. For example:
% cabal run code128 -- decodeB 11010010000111010001101000011010011000010010100001101001010011110010110010000100001001101000011010010010110000111100100101100011101011
Wikipedia
We will migrate the user interface from bitstrings to images in a subsequent assignment.
The following copies and adapts heavily from the Wikipedia article on the topic. (Now may be a good time to donate $1.28 or $12.8 or $128 to the cause.)
Code 128 barcodes represent ASCII strings. In this assignment, we
will support all ASCII
printable characters — ASCII (decimal) codes 32 to 126, which
correspond to alphanumeric (0-9, A-Z, a-z) and other characters — but we
will forgo ASCII
control characters — ASCII codes 0 to 31, and 127. (Note: Our implementation will not manipulate the
(Int
eger) ASCII values of Char
acters.)
Each barcode has several sections:
A symbol consists of 3 black bars and 3 white spaces of varying widths. Each symbol begins with a bar and ends with a space. The 6 bars and spaces alternate.
The width of each bar and space is a multiple of a basic unit called a module. Each bar and space is 1 to 4 modules wide. Furthermore, each symbol has a fixed width: the total width of the 6 bars and spaces is 11 modules.
Compared to other symbols, the stop pattern has 7 bars and spaces with a total width of 13 modules.
Two quiet zones, comprising all spaces, appear to the left and right of all symbols. The minimum width of each quiet zone is 10 modules. Since we will not be working with images in this assignment, we will not represent quiet zones in our implementation.
The processes for encoding data and computing checksums are described next.
There are 108 symbols: 103 data symbols labeled 0 through 102 (most of which map to ASCII values; the rest are for code-specific purposes), plus 5 start and stop symbols. In this table, non-ASCII values are delimited by square brackets.
The patterns column comprises binary string representations for each symbol. The widths column is an alternative representation, denoting lengths of the alternating runs of bars and spaces. Patterns and widths are interderivable.
Each symbol can be interpreted in three different code sets, called A, B, and C. We are not concerned with ASCII control code characters. Therefore, we will not support code set A in our implementation, nor the “[DEL]” character in code set B.
There are 3 start symbols that indicate the initial code set to use when decoding the data.
Most code set B values represent one (printable ASCII) character. In contrast, most code set C values represent two (digit) characters. Code set C enables more compression of ASCII strings with many digits (more on this below).
Code sets B and C each map some symbols to non-ASCII values:
We will not support “[Shift]” values (relevant only for code set A) nor “[FNC]” values.
For simplicity, neither will we support “[Code]” values in this assignment; the initial code set will be used to process all symbols. (We will compute optimal encodings by mixing and matching B and C code sets in the future.)
Lastly, the “[Reverse Stop]” symbol indicates to a physical scanner that a barcode should be decoded in reverse. For simplicity, we will not support reverse decoding. (Besides, we are not living in a material world in this assignment.)
The checksum symbol is defined by a weighted modulo-103 calculation. As an example, consider the string “Haskell!” that, when using code set B, results in a checksum of 1 as follows:
B Value | Id | Position | = Id * Position |
---|---|---|---|
Start B | 104 | 1 | 104 |
H | 40 | 1 | 40 |
a | 65 | 2 | 130 |
s | 83 | 3 | 249 |
k | 75 | 4 | 300 |
e | 69 | 5 | 345 |
l | 76 | 6 | 456 |
l | 76 | 7 | 532 |
! | 1 | 8 | 8 |
Sum: | 2164 | ||
Mod 103: | 1 |
The id of each encoded character is multiplied by its position in the barcode; the exception to this rule is that the start symbol and the first encoded character are both weighted as 1. The checksum is then defined to be the weighted sum modulo 103. This is the calculation process regardless of which code sets are used in the data encoding. (For computing checksums, “[Code]” and “[Shift]” values and their positions are treated the same as encoded characters.)
Half the battle for a programming task like this is understanding the problem domain and modeling its data. That work is largely done for you.
Zeroth, we pared down the Wikipedia article into the description above.
First, we simplified and reformatted the tables into the one shown above, and then saved it as a CSV file in the starter code. We “cleaned” the minor inconsistencies that arose among the different formats and software applications involved.
Furthermore, the starter code defines a series of datatypes and type synonyms to model the data and direct their manipulation. As we describe each component in turn, notice how Haskell type definitions, even on their own, inform how different data transformations will (need to) be implemented.
The user-facing representation of barcodes in this assignment are
bitstrings. We will represent each bit as a Bool
ean (also
called a Module
), and a barcode string
BCString
as a list of Symbol
s, where each
Symbol
comprises 11 Module
s. Valid barcode
strings always have a final bar, so they are not explicitly represented
as data in a BCString
. (The stop symbol is always the same,
too, but we have opted to keep it for no particularly good reason.)
data BCString = -- "Barcode String"
BCString [Symbol] -- Start, data, checksum, and stop symbols.
-- The final bar ("11") is implicit.
-- No quiet zones.
type Symbol = [Module] -- Always length 11
type Module = Bool -- True/False means part of bar/space
While BCString
s will be used to get and show barcodes to
the user, most of our work will involve a more “internal”
representation, called BC
. Compared to
BCString
s, this representation (a) structures the data into
start symbol, encoded data, checksum symbol, and stop symbol (again,
final bar is implicit), and (b) identifies Symbol
s by their
SymbolId
s (i.e. the Int
s defined in the table above) for readability.
type BC = -- "Barcode" (internal representation)
( SymbolId -- Start symbol
, [SymbolId] -- Encoded data
, SymbolId -- Checksum
, SymbolId -- Stop symbol
) -- Final bar is implicit
type SymbolId = Int
(If we were building this library for real, we would actually make these types internal by controlling what our module exports.)
The table above has several
columns, the values of which will serve as “keys” to index corresponding
values from other columns. We could represent the entire table as a list
of rows, for example, of type [[String]]
. Various lookup
operations would then take O(mn) time, where
m is the number of rows and
n is the number columns. This
may be fine for our little table and small data strings, however we will
use use the opportunity to structure the table into a more efficient
representation.
Another approach is to create an association list of type
[(k, v)]
for each pair of columns we need to map (with some
key type k
and value type v
). For example, a
mapping from symbol ids to patterns would have type
[(SymbolId, Symbol)]
. Lookup operations, implemented via lookup
,
would take O(m) time.
Again, this is probably fine in our case, but let’s use a more scalable
representation anyway.
The Data.Map library — actually implemented in Data.Map.Lazy — provides an API for finite maps (a.k.a. association lists a.k.a. dictionaries a.k.a. lookup tables) with logarithmic running times.
There are a variety of balanced search trees that can be used to support efficient map operations. And it turns out that lazy evaluation plays an important role in these implementations, given the emphasis in pure functional languages on copying immutable data (as opposed to mutating existing data). If you are interested in a taste, go back in time and check out CMSC 22300 (circa 2020) and Chris Okasaki’s classic Purely Functional Data Structures.
For our simple purposes, here are the relevant parts of the API:
empty :: Map k a
insert :: Ord k => k -> a -> Map k a -> Map k a
lookup :: Ord k => k -> Map k a -> Maybe a
(!) :: Ord k => Map k a -> k -> a -- (!) is "unsafe" lookup
When the program first starts, we will read in the
code128.csv
file and construct the following big record of
all the information we may need during encoding and decoding:
data TheCodes =
TheCodes
{ startB :: SymbolId
, startC :: SymbolId
, stop :: SymbolId
, idsToSymbols :: Map SymbolId Symbol
, symbolsToIds :: Map Symbol SymbolId
, bEncodings :: Map Char SymbolId
, cEncodings :: Map (Char, Char) SymbolId
, bDecodings :: Map SymbolId BValue
, cDecodings :: Map SymbolId CValue
} deriving (Show)
type BValue = Either Char String
type CValue = Either (Char, Char) String
There are several components to the TheCodes
. First,
startB
, startC
, and stop
are the
ids for the relevant symbols; use these rather than sprinkling magic
numbers throughout your code. Second, idsToSymbols
and
symbolsToIds
map symbol ids to symbols and vice versa.
Next, bEncodings
maps (printable) ASCII
Char
acters to their symbolic encodings, and
bDecodings
goes the other way. In the reverse direction,
BValue
represents either a printable Char
acter
or some other String
(ASCII control codes, or code-specific
values such as “[Code X]”). Lastly, cEncodings
and
cDecodings
are similar, except that now (printable) ASCII
characters come in pairs.
Encoding and decoding can fail for a variety of reasons. We will represent errors simply as strings (a more robust implementation might provide more structured error values).
type Error = String
Operations that, upon success, produce values of type some type
a
will actually produce values of type
Either Error a
which is inhabited by values of the form Left errorMsg
where errorMsg :: String
and values of the form
Right result
where result :: a
. (See Prelude
and Data.Either
for more info.)
Sample interactions below show a variety of error messages. For
testing and grading purposes, there are no requirements on the
particular error messages; we will simply check that Left
is correctly produced in error cases; the string it carries will be
disregarded.
The starter code contains several files:
code128.csv
: Same as the table above. Don’t change this
file.
Code128.hs
: The functions to implement are organized
into several groups:
Note: No, there aren’t 223
undefined
s to fill in. Only ~20… But many are modest helper
functions or fill-in-the blanks, in service of the primary functions —
{en,de}code{B,C}
for encoding and decoding, and several
fiddly string manipulation functions.
Tester.hs
:
Tester.hs
(discussed below).code128.cabal
: This sets up two executables:
cabal run code128
cabal run tests
You may tweak the .cabal
file with additional packages
and language extensions, if needed.
cabal
issues sooner rather than later.
Do not change the type signature of any function that is provided.
Read the code128.csv
file and load its data into a value
of type TheCodes
.
loadTheCodes :: IO TheCodes
For example:
> :load Code128.hs
> theCodes <- loadTheCodes
> startB theCodes
104
> (bEncodings theCodes) ! 'H'
40
Note: Storing our “database” as a separate
file, rather than as data embedded directly into Haskell file, means
that this operation — and those which depend on it, such as all encoding
and decoding functions! — is an IO
action and is
type-checked as such.
When building TheCodes
, you may assume there are no
errors in the CSV file; it corresponds exactly to the HTML table above. Similar, when using
TheCodes
, you may perform unsafe lookups — that is,
(!)
instead of lookup
.
Notice that the function…
rowsToCodes :: [[String]] -> TheCodes
… is the main workhorse, performing a foldr
over rows
from the CSV file, building up the six lookup tables one row at a
time.
The skeleton requires you to implement several helper functions,
which are intended for use in loadTheCodes
:
splitOn :: Char -> String -> [String]
dropFirstAndLast :: [a] -> [a]
readSymbol :: String -> Symbol
readBValue :: String -> BValue
readCValue :: String -> CValue
Note: If you don’t find it helpful to
implement the helper functions readSymbol
,
readBValue
, or readCValue
, you can leave any
of these three as undefined
.
Recall that we will not support switching between code sets in this
assignment. So, our “basic” encodings either always use B or always use
C. (The type BC
means barcode, not code sets B and/or
C.)
encodeB :: TheCodes -> String -> Either Error BC
encodeC :: TheCodes -> String -> Either Error BC
The following helper should be used in both:
computeChecksum :: SymbolId -> [SymbolId] -> Int
There are beautiful, readable, one-line implementations of
computeChecksum
— find one!
The following little helper can be used in encodeB
:
isPrintable :: Char -> Bool
A few examples:
> :load Code128.hs
> theCodes <- loadTheCodes
> encodeB theCodes "Haskell!"
Right (104,[40,65,83,75,69,76,76,1],1,106)
> encodeB theCodes "CMSC 22300? HOT Programming in Haskell!"
Right (104,[35,45,51,35,0,18,18,19,16,16,31,0,40,47,52,0,48,82,79,71,82,65,77,77,73,78,71,0,73,78,0,40,65,83,75,69,76,76,1],72,106)
> encodeB theCodes "Separation of ⛪ and 🏛"
Left "encodeB: unsupported characters"
> encodeC theCodes "Haskell!"
Left "encodeC: unsupported characters"
Once you start iteratively developing and testing, you will be
annoyed to run the loadTheCodes
action every time you
reload. So, the starter code provides helper functions
testEncodeB
and testEncodeC
that can be used
as follows:
> run encodeB "Haskell!"
Right (104,[40,65,83,75,69,76,76,1],1,106)
> run encodeC "Haskell!"
Left "encodeC: unsupported characters"
Pop Quiz: Before peeking in the repo, what
is the type of run
? Implement it!
Strings comprising only digits can be encoded either with code set B or C:
> let piString = "3141592653"
> run encodeB piString
Right (104,[19,17,20,17,21,25,18,22,21,19],88,106)
> run encodeC piString
Right (105,[31,41,59,26,53],43,106)
> run encodeC "22300"
Left "encodeC: odd number of characters"
Implement the following two functions, and use them in
encodeC
if they would be helpful:
adjacentPairs :: [a] -> [(a, a)]
sequenceMaybes :: [Maybe a] -> Maybe [a]
The former should crash if given an odd number of elements. The latter has a strange name, but there will be a connection (eventually).
Like with encoding, we are supporting “basic” decodings that either always use B or always use C.
decodeB :: TheCodes -> BC -> Either Error String
decodeC :: TheCodes -> BC -> Either Error String
For example (the run
function works with these decoders,
too):
> run decodeB (104,[40,65,83,75,69,76,76,1],1,106)
Right "Haskell!"
> run decodeB (104,[19,17,20,17,21,25,18,22,21,19],88,106)
Right "3141592653"
> run decodeC (105,[31,41,59,26,53],43,106)
Right "3141592653"
> run decodeC (0, [], 0, 0)
Left "decodeC: bad start: 0"
Equipped with the core operations for encoding and decoding barcodes
(BC
s), the last step is to connect them with the user
interface, namely, bitstrings (BCString
s).
The implementation also provides two functions for going between the
external (string) and internal representations; these are handy for the
main
interface and testing:
bcToBCString :: TheCodes -> BC -> BCString
bcStringToBC :: TheCodes -> BCString -> BC
Your tasks are to complete the Show
and
Read
instances by implementing the following:
show :: BCString -> String
maybeReadBCString :: String -> Maybe BCString
The Tester.hs
file sets up a little test runner:
% cabal run tests
PASS ...
PASS ...
FAIL ...
...
Note that this tests
executable will crash as soon as
any of your code does. So, to test your code incrementally, you can
comment out lines of the form
runTests $ FUNCTION_TO_TEST theCodes
until you are ready to test
FUNCTION_TO_TEST
∈ {encodeB
, encodeC
, decodeB
, decodeC
, encodeAndShow
, readAndDecode
}
You are required to add at least 2 new test
cases for each of these six functions (though you should
probably add more to thoroughly test your implementations). See the
bottom of Tester.hs
.
Note: The tester is set up for functions of
type (TheCodes -> a -> Either Error b)
, dealing with
the plumbing to load TheCodes
and propagate errors. If you
would like to set up helpers for testing other required functions, which
have simpler type signatures than these, go ahead. But please make sure
not to change any existing names and type signatures.
Note: If you want to run
cabal run code128 -- encodeB STRING-WITH-SPECIAL-CHARACTERS
with an input string containing, for example, double-quotes or
exclamation marks, you will need to use escape characters. See
Tester.hs
for an example.
You may or may not find the following helpful.
From Prelude:
head :: [a] -> a
tail :: [a] -> [a]
last :: [a] -> a
uncons :: [a] -> Maybe (a, [a])
reverse :: [a] -> [a]
length :: [a] -> Int
lines :: String -> [String]
unlines :: [String] -> String
zip :: [a] -> [b] -> [(a, b)]
splitAt :: Int -> [a] -> ([a], [a])
not :: Bool -> Bool
elem :: Eq a => a -> [a] -> Bool
all :: (a -> Bool) -> [a] -> Bool
any :: (a -> Bool) -> [a] -> Bool
concat :: [[a]] -> [a]
concatMap :: (a -> [b]) -> [a] -> [b]
(++) :: [a] -> [a] -> [a]
(<$>), fmap :: Functor t => (a -> b) -> t a -> t b
The unsafe functions head
, tail
, and
last
are normally not recommended. But, we have some
invariants about our lists, so we’ll use them.
From Data.Char:
ord :: Char -> Int
chr :: Int -> Char
Here is an outline of the grading rubric (10 points total):
1.0
Data Loading
0.1
splitOn
0.1
dropFirstAndLast
0.1
readSymbol
0.1
readBValue
0.1
readCValue
0.5
loadTheCodes
3.0
Encoding
0.5
computeChecksum
0.1
isPrintable
0.2
adjacentPairs
0.2
sequenceMaybes
1.0
encodeB
1.0
encodeC
2.0
Decoding
1.0
decodeB
1.0
decodeC
2.0
Barcode String Manipulation
1.0
show
1.0
maybeReadBCString
1.0
Adding Test Cases
{en,de}code{B,C}
,
encodeAndShow
, readAndDecode
1.0
Clean (Haskell) code
Once you are finished, check that you add
ed,
commit
ted, and push
ed everything to your
repository before the deadline (or within an eligible late period):
https://github.com/UChicago-PL/cs223-fa23-pa-4-barcodes-USERNAME