Barcodes

PA 4 — Due: Fri Oct 27 11pm CT

Recall the academic integrity policy: turn in your own work and cite all sources.

It’s time to do some coding! Haskell coding, of course, but also barcoding. I’ve always wanted to learn, so here’s my chance — and yours to further develop your data structure manipulation skills using the facilities in Haskell. The particular code we will code is Code 128. When finished, your program will:

We will migrate the user interface from bitstrings to images in a subsequent assignment.

Code 128

The following copies and adapts heavily from the Wikipedia article on the topic. (Now may be a good time to donate $1.28 or $12.8 or $128 to the cause.)

Code 128 barcodes represent ASCII strings. In this assignment, we will support all ASCII printable characters — ASCII (decimal) codes 32 to 126, which correspond to alphanumeric (0-9, A-Z, a-z) and other characters — but we will forgo ASCII control characters — ASCII codes 0 to 31, and 127. (Note: Our implementation will not manipulate the (Integer) ASCII values of Characters.)

Each barcode has several sections:

  1. Quiet zone
  2. Start symbol
  3. Encoded data symbols
  4. Checksum symbol
  5. Stop pattern (stop symbol + final bar)
  6. Quiet zone

A symbol consists of 3 black bars and 3 white spaces of varying widths. Each symbol begins with a bar and ends with a space. The 6 bars and spaces alternate.

The width of each bar and space is a multiple of a basic unit called a module. Each bar and space is 1 to 4 modules wide. Furthermore, each symbol has a fixed width: the total width of the 6 bars and spaces is 11 modules.

Compared to other symbols, the stop pattern has 7 bars and spaces with a total width of 13 modules.

Two quiet zones, comprising all spaces, appear to the left and right of all symbols. The minimum width of each quiet zone is 10 modules. Since we will not be working with images in this assignment, we will not represent quiet zones in our implementation.

The processes for encoding data and computing checksums are described next.

The Codes

There are 108 symbols: 103 data symbols labeled 0 through 102 (most of which map to ASCII values; the rest are for code-specific purposes), plus 5 start and stop symbols. In this table, non-ASCII values are delimited by square brackets.

The patterns column comprises binary string representations for each symbol. The widths column is an alternative representation, denoting lengths of the alternating runs of bars and spaces. Patterns and widths are interderivable.

Each symbol can be interpreted in three different code sets, called A, B, and C. We are not concerned with ASCII control code characters. Therefore, we will not support code set A in our implementation, nor the “[DEL]” character in code set B.

There are 3 start symbols that indicate the initial code set to use when decoding the data.

Most code set B values represent one (printable ASCII) character. In contrast, most code set C values represent two (digit) characters. Code set C enables more compression of ASCII strings with many digits (more on this below).

Unsupported Symbols

Code sets B and C each map some symbols to non-ASCII values:

We will not support “[Shift]” values (relevant only for code set A) nor “[FNC]” values.

For simplicity, neither will we support “[Code]” values in this assignment; the initial code set will be used to process all symbols. (We will compute optimal encodings by mixing and matching B and C code sets in the future.)

Lastly, the “[Reverse Stop]” symbol indicates to a physical scanner that a barcode should be decoded in reverse. For simplicity, we will not support reverse decoding. (Besides, we are not living in a material world in this assignment.)

Checksums

The checksum symbol is defined by a weighted modulo-103 calculation. As an example, consider the string “Haskell!” that, when using code set B, results in a checksum of 1 as follows:

B Value Id Position = Id * Position
Start B 104 1 104
H 40 1 40
a 65 2 130
s 83 3 249
k 75 4 300
e 69 5 345
l 76 6 456
l 76 7 532
! 1 8 8
Sum: 2164
Mod 103: 1

The id of each encoded character is multiplied by its position in the barcode; the exception to this rule is that the start symbol and the first encoded character are both weighted as 1. The checksum is then defined to be the weighted sum modulo 103. This is the calculation process regardless of which code sets are used in the data encoding. (For computing checksums, “[Code]” and “[Shift]” values and their positions are treated the same as encoded characters.)

Data Representation

Half the battle for a programming task like this is understanding the problem domain and modeling its data. That work is largely done for you.

Zeroth, we pared down the Wikipedia article into the description above.

First, we simplified and reformatted the tables into the one shown above, and then saved it as a CSV file in the starter code. We “cleaned” the minor inconsistencies that arose among the different formats and software applications involved.

Furthermore, the starter code defines a series of datatypes and type synonyms to model the data and direct their manipulation. As we describe each component in turn, notice how Haskell type definitions, even on their own, inform how different data transformations will (need to) be implemented.

Barcode Strings

The user-facing representation of barcodes in this assignment are bitstrings. We will represent each bit as a Boolean (also called a Module), and a barcode string BCString as a list of Symbols, where each Symbol comprises 11 Modules. Valid barcode strings always have a final bar, so they are not explicitly represented as data in a BCString. (The stop symbol is always the same, too, but we have opted to keep it for no particularly good reason.)

data BCString =          -- "Barcode String"
  BCString [Symbol]      --   Start, data, checksum, and stop symbols.
                         --   The final bar ("11") is implicit.
                         --   No quiet zones.

type Symbol = [Module]   -- Always length 11
type Module = Bool       -- True/False means part of bar/space

Internal Barcode Representation

While BCStrings will be used to get and show barcodes to the user, most of our work will involve a more “internal” representation, called BC. Compared to BCStrings, this representation (a) structures the data into start symbol, encoded data, checksum symbol, and stop symbol (again, final bar is implicit), and (b) identifies Symbols by their SymbolIds (i.e. the Ints defined in the table above) for readability.

type BC =                -- "Barcode" (internal representation)
  ( SymbolId             --   Start symbol
  , [SymbolId]           --   Encoded data
  , SymbolId             --   Checksum
  , SymbolId             --   Stop symbol
  )                      --   Final bar is implicit

type SymbolId = Int

(If we were building this library for real, we would actually make these types internal by controlling what our module exports.)

Lookup Tables

The table above has several columns, the values of which will serve as “keys” to index corresponding values from other columns. We could represent the entire table as a list of rows, for example, of type [[String]]. Various lookup operations would then take O(mn) time, where m is the number of rows and n is the number columns. This may be fine for our little table and small data strings, however we will use use the opportunity to structure the table into a more efficient representation.

Another approach is to create an association list of type [(k, v)] for each pair of columns we need to map (with some key type k and value type v). For example, a mapping from symbol ids to patterns would have type [(SymbolId, Symbol)]. Lookup operations, implemented via lookup, would take O(m) time. Again, this is probably fine in our case, but let’s use a more scalable representation anyway.

The Data.Map library — actually implemented in Data.Map.Lazy — provides an API for finite maps (a.k.a. association lists a.k.a. dictionaries a.k.a. lookup tables) with logarithmic running times.

There are a variety of balanced search trees that can be used to support efficient map operations. And it turns out that lazy evaluation plays an important role in these implementations, given the emphasis in pure functional languages on copying immutable data (as opposed to mutating existing data). If you are interested in a taste, go back in time and check out CMSC 22300 (circa 2020) and Chris Okasaki’s classic Purely Functional Data Structures.

For our simple purposes, here are the relevant parts of the API:

empty  :: Map k a
insert :: Ord k => k -> a -> Map k a -> Map k a
lookup :: Ord k => k -> Map k a -> Maybe a
(!)    :: Ord k => Map k a -> k -> a             -- (!) is "unsafe" lookup

When the program first starts, we will read in the code128.csv file and construct the following big record of all the information we may need during encoding and decoding:

data TheCodes =
  TheCodes
    { startB       :: SymbolId
    , startC       :: SymbolId
    , stop         :: SymbolId
    , idsToSymbols :: Map SymbolId Symbol
    , symbolsToIds :: Map Symbol SymbolId
    , bEncodings   :: Map Char SymbolId
    , cEncodings   :: Map (Char, Char) SymbolId
    , bDecodings   :: Map SymbolId BValue
    , cDecodings   :: Map SymbolId CValue
    } deriving (Show)

type BValue = Either Char String
type CValue = Either (Char, Char) String

There are several components to the TheCodes. First, startB, startC, and stop are the ids for the relevant symbols; use these rather than sprinkling magic numbers throughout your code. Second, idsToSymbols and symbolsToIds map symbol ids to symbols and vice versa. Next, bEncodings maps (printable) ASCII Characters to their symbolic encodings, and bDecodings goes the other way. In the reverse direction, BValue represents either a printable Character or some other String (ASCII control codes, or code-specific values such as “[Code X]”). Lastly, cEncodings and cDecodings are similar, except that now (printable) ASCII characters come in pairs.

Errors

Encoding and decoding can fail for a variety of reasons. We will represent errors simply as strings (a more robust implementation might provide more structured error values).

type Error = String

Operations that, upon success, produce values of type some type a will actually produce values of type

Either Error a

which is inhabited by values of the form Left errorMsg where errorMsg :: String and values of the form Right result where result :: a. (See Prelude and Data.Either for more info.)

Sample interactions below show a variety of error messages. For testing and grading purposes, there are no requirements on the particular error messages; we will simply check that Left is correctly produced in error cases; the string it carries will be disregarded.

223 Tasks

The starter code contains several files:

Do not change the type signature of any function that is provided.

1. Data Loading

Read the code128.csv file and load its data into a value of type TheCodes.

loadTheCodes :: IO TheCodes

For example:

> :load Code128.hs
> theCodes <- loadTheCodes
> startB theCodes
104
> (bEncodings theCodes) ! 'H'
40

Note: Storing our “database” as a separate file, rather than as data embedded directly into Haskell file, means that this operation — and those which depend on it, such as all encoding and decoding functions! — is an IO action and is type-checked as such.

When building TheCodes, you may assume there are no errors in the CSV file; it corresponds exactly to the HTML table above. Similar, when using TheCodes, you may perform unsafe lookups — that is, (!) instead of lookup.

Notice that the function…

rowsToCodes :: [[String]] -> TheCodes

… is the main workhorse, performing a foldr over rows from the CSV file, building up the six lookup tables one row at a time.

The skeleton requires you to implement several helper functions, which are intended for use in loadTheCodes:

splitOn :: Char -> String -> [String]

dropFirstAndLast :: [a] -> [a]

readSymbol :: String -> Symbol

readBValue :: String -> BValue

readCValue :: String -> CValue

Note: If you don’t find it helpful to implement the helper functions readSymbol, readBValue, or readCValue, you can leave any of these three as undefined.

2. Basic Encodings

Recall that we will not support switching between code sets in this assignment. So, our “basic” encodings either always use B or always use C. (The type BC means barcode, not code sets B and/or C.)

encodeB :: TheCodes -> String -> Either Error BC
encodeC :: TheCodes -> String -> Either Error BC

The following helper should be used in both:

computeChecksum :: SymbolId -> [SymbolId] -> Int

There are beautiful, readable, one-line implementations of computeChecksum — find one!

The following little helper can be used in encodeB:

isPrintable :: Char -> Bool

Example B-Encodings

A few examples:

> :load Code128.hs
> theCodes <- loadTheCodes

> encodeB theCodes "Haskell!"
Right (104,[40,65,83,75,69,76,76,1],1,106)

> encodeB theCodes "CMSC 22300? HOT Programming in Haskell!"
Right (104,[35,45,51,35,0,18,18,19,16,16,31,0,40,47,52,0,48,82,79,71,82,65,77,77,73,78,71,0,73,78,0,40,65,83,75,69,76,76,1],72,106)

> encodeB theCodes "Separation of ⛪ and 🏛"
Left "encodeB: unsupported characters"

> encodeC theCodes "Haskell!"
Left "encodeC: unsupported characters"

Once you start iteratively developing and testing, you will be annoyed to run the loadTheCodes action every time you reload. So, the starter code provides helper functions testEncodeB and testEncodeC that can be used as follows:

> run encodeB "Haskell!"
Right (104,[40,65,83,75,69,76,76,1],1,106)

> run encodeC "Haskell!"
Left "encodeC: unsupported characters"

Pop Quiz: Before peeking in the repo, what is the type of run? Implement it!

Example C-Encodings

Strings comprising only digits can be encoded either with code set B or C:

> let piString = "3141592653"

> run encodeB piString
Right (104,[19,17,20,17,21,25,18,22,21,19],88,106)

> run encodeC piString
Right (105,[31,41,59,26,53],43,106)

> run encodeC "22300"
Left "encodeC: odd number of characters"

Implement the following two functions, and use them in encodeC if they would be helpful:

adjacentPairs :: [a] -> [(a, a)]

sequenceMaybes :: [Maybe a] -> Maybe [a]

The former should crash if given an odd number of elements. The latter has a strange name, but there will be a connection (eventually).

3. Basic Decodings

Like with encoding, we are supporting “basic” decodings that either always use B or always use C.

decodeB :: TheCodes -> BC -> Either Error String

decodeC :: TheCodes -> BC -> Either Error String

For example (the run function works with these decoders, too):

> run decodeB (104,[40,65,83,75,69,76,76,1],1,106)
Right "Haskell!"

> run decodeB (104,[19,17,20,17,21,25,18,22,21,19],88,106)
Right "3141592653"

> run decodeC (105,[31,41,59,26,53],43,106)
Right "3141592653"

> run decodeC (0, [], 0, 0)
Left "decodeC: bad start: 0"

4. Barcode String Manipulation

Equipped with the core operations for encoding and decoding barcodes (BCs), the last step is to connect them with the user interface, namely, bitstrings (BCStrings).

The implementation also provides two functions for going between the external (string) and internal representations; these are handy for the main interface and testing:

bcToBCString :: TheCodes -> BC -> BCString
bcStringToBC :: TheCodes -> BCString -> BC

Your tasks are to complete the Show and Read instances by implementing the following:

show :: BCString -> String

maybeReadBCString :: String -> Maybe BCString

Tests

The Tester.hs file sets up a little test runner:

% cabal run tests

PASS ...
PASS ...
FAIL ...
...

Note that this tests executable will crash as soon as any of your code does. So, to test your code incrementally, you can comment out lines of the form

runTests $ FUNCTION_TO_TEST theCodes

until you are ready to test

FUNCTION_TO_TEST ∈ {encodeB, encodeC, decodeB, decodeC, encodeAndShow, readAndDecode}

You are required to add at least 2 new test cases for each of these six functions (though you should probably add more to thoroughly test your implementations). See the bottom of Tester.hs.

Note: The tester is set up for functions of type (TheCodes -> a -> Either Error b), dealing with the plumbing to load TheCodes and propagate errors. If you would like to set up helpers for testing other required functions, which have simpler type signatures than these, go ahead. But please make sure not to change any existing names and type signatures.

Note: If you want to run

cabal run code128 -- encodeB STRING-WITH-SPECIAL-CHARACTERS

with an input string containing, for example, double-quotes or exclamation marks, you will need to use escape characters. See Tester.hs for an example.

Hints

You may or may not find the following helpful.

From Prelude:

The unsafe functions head, tail, and last are normally not recommended. But, we have some invariants about our lists, so we’ll use them.

From Data.Char:

Grading

Here is an outline of the grading rubric (10 points total):

Sanity Check

Once you are finished, check that you added, committed, and pushed everything to your repository before the deadline (or within an eligible late period):

https://github.com/UChicago-PL/cs223-fa23-pa-4-barcodes-USERNAME