==========================
Analyzing Candidate Tweets
==========================

**Due: Friday, October 26th at 4pm**

You may work alone or in a pair on this assignment. 

The purpose of this assignment is to give you experience with using
dictionaries to represent data and as a mechanism for mapping keys to
values that change over time.  You will also get a chance to practice
using functions to avoid repeated code.


Introduction
============

On April 18th, 2017, Theresa May, the Prime Minister of the United Kingdom,
announced that she would call a snap election.  This news came as a
bit of a surprise since the next election was not due to be held until
2020.  Her stated reason for calling the election was a desire to
strengthen the UK Government's hand in negotiations with the European
Union (EU) over Britain's exit from the EU (colloquially referred to
as Brexit).  While the election did not necessarily play out to Prime
Minister May's satisfaction, it did yield a trove of tweets that we can
mine for insight into British politics.

Unlike US presidential elections, which seem to last forever, in the UK, the
period between when a general election is officially called 
and the date the election is held is quite short, typically six weeks
or so.  During this pre-election period, known as `purdah
<https://en.wikipedia.org/wiki/Purdah_(pre-election_period)>`__, civil
servants are restricted from certain activities that might be viewed
as biased towards the party currently in power.  Purdah ran from April
22nd until June 9th for the 2017 General Election.

For this assignment, we'll be analyzing tweets sent from the official
Twitter feeds of four parties: the Conservative and Unionist Party
(``@Conservatives``), the Labour Party (``@UKLabour``), the Liberal Democrats
(``@LibDems``) and the Scottish National Party (``@theSNP``) during purdah.
We'll ask question such as:

- What was ``@Conservatives``'s favorite hashtag during purdah? [``#bbcqt``]
- Who was mentioned at least 50 times by ``@UKLabour``? [``@jeremycorbyn``]
- What words occurred most often in ``@theSNP``'s tweets? [``snp, scotland, our, have, more``]
- What two-word phrases occurred most often in ``@LibDems``'s tweets? [``stand up, stop tories, will stand, theresa may, lib dems``]
- How do the parties' feeds change over time?  [It is a short election season, so not too much, especially for the Conservatives.]

For those of you who do not follow British politics, a few notes:

#. The hashtag ``#bbcqt`` refers to *BBC Question Time*, a political debate program on the British Broadcasting Company.
#. Jeremy Corbyn is the leader of the Labour Party.
#. Theresa May is the leader of the Conservatives.
#. Nicola Sturgeon is the leader of the Scottish National Party.
#. Tim Farron was the leader of the Liberal Democrats during the 2017 election.
#. The Conservatives are also known as the Tories.


Getting Started
===============

Students working alone should use their personal repositories.  Pairs
should work exclusively in their pair repositories.  Please see the
instructions linked to below for instructions on how to get a pair
repository.

See `these start-up instructions <startup-alone.html>`__  if you intend to work alone.

See `these start-up instructions <startup-pair-continuing.html>`__ if you intend to work with the same partner as in PA #2.

See `these start-up instructions <startup-pair.html>`__ if you intend to work in a *NEW* pair.


A description of the contents of the ``pa3`` directory follows. There are two 
files you will be modifying:

- ``basic_algorithms.py`` -- you will add code for Part 1 to this file.

- ``analyze.py`` -- you will add code for the tasks in Part 2 to this file.

Two other files contain the test code we will use thoughout this
assignment. You will not modify these files:

- ``test_basic_algorithms.py`` -- this file contains very basic test code for the algorithms you will implement for Task 0.

- ``test_analyze.py`` -- test code for Part 2 of this writeup.

Three files contain functions that will be helpful in completing the 
assignment. You will not modify these files:

- ``clean.py`` -- this file contains a few useful functions for cleaning the data.

- ``load_tweets.py`` -- the file contains code to load the tweets for the four different parties to use during testing.

- ``util.py`` -- this file contains a few useful functions.

Finally, the last part of the distribution is the tweet data:

-  ``data/`` -- a directory for the tweet files. Some of the data files are large and are not included in the Git repository. See the Data_ section below for instructions on how to obtain these files.

Recall that you can set-up ``ipython3`` to reload code automatically
when it is modified by running the following commands after you fire
up ``ipython3``:

::

    In [1]: %load_ext autoreload

    In [2]: %autoreload 2

    In [3]: import basic_algorithms, analyze


Part 1: Basic algorithms
========================

Algorithms for efficiently counting and sorting distinct `items`, or unique 
values, are widely used in data analysis. For example, we might want to 
find the most frequent keyword used in a document or the top 10 most popular 
links clicked on a website. In Part 1, you will implement three such 
algorithms: ``find_top_k``, ``find_min_count``, and ``find_frequent``. 
The following sorting function will be very useful in their implementation.


Task 0
~~~~~~

Your first task is to add code to ``basic_algorithms.py`` to implement
the three algorithms described below.  We have provided code,
``test_basic_algorithms.py``, that runs a few tests cases for each the
algorithms.  As in the previous assignments, our test code uses the ``pytest``
testing framework.  You can use the ``-k`` flag and the name of the
algorithm (``top_k``, ``min_count``, and ``frequent``) to run the test
code for a specific algorithm.  For example, running the following
command from the **Linux command-line**::

    $ py.test -xv -k min_count test_basic_algorithms.py 

will run the tests for the `min count` algorithm.  (As usual, we use
``$`` to signal the **Linux command-line prompt**.  You should not
type the ``$``.)


Counting distinct items
~~~~~~~~~~~~~~~~~~~~~~~~~~
The first step is to write a helper function that counts distinct 
items, ``count_items``. 

This function takes as input a list of items 
(in our case strings, but they can be any comparable and hashable type), 
and returns a new list with each distinct string and the number of times 
it appears in the list.

#. Use a dictionary to count the number of times each unique value occurs,
#. Extract a list of (key, count) pairs from the dictionary,
#. Return this list.

For example, if we have a list:

.. code-block:: python

    ['A', 'B', 'C', 'A']

the function should yield (the order is not important):

.. code-block:: python

    [('A', 2), ('B', 1), ('C', 1)]


Notes:

#. Do not use python libraries other than what is already imported. (For example, you may *not* use ``collections.Counter``.)
#. If you use python's ``list.count`` method, you are probably not following the instructions above to use a dictionary to perform the counting and will similarly not receive credit.


Ordering tuples
~~~~~~~~~~~~~~~

We provide a function, named ``sort_count_pairs``, that will handle sorting 
the pairs for you.  Given a python list of tuples of (string, integer):

.. code-block:: python

    [(k0, v0), (k1, v1), ...]

the function sorts the list in descending order by the integer values.   
We use the keys (``k0``, ``k1``, etc) to break ties and order them in 
ascending lexicographic (alphabetical) order.  

Note: this differs from the default sort order for pairs in python which 
uses the first item in the pair as the primary sort key and the second 
value as the secondary sort key and sorts both in ascending order. 

For example, given the list:

.. code-block:: python

    [('D', 1), ('C', 2), ('A', 5), ('B', 2)]

our function ``sort_count_pairs`` should return:

.. code-block:: python

    [('A', 5), ('B', 2), ('C', 2), ('D', 1)]


Top K
~~~~~

The first algorithm, ``find_top_k``, computes the :math:`K` items that occur
most frequently in the list.  To do this computation, you will need
to:

#. Count the items
#. Sort the resulting pairs using the supplied function ``sort_count_pairs``.
#. Return the first :math:`K` pairs. These pairs must be sorted in descending order of the counts.

Here is an example use of this function:

::

    In [4]: l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

    In [5]: basic_algorithms.find_top_k(l, 2)
    Out[5]: [('A', 5), ('B', 2)]


You can test your code by running::

    $ py.test -xv -k top_k test_basic_algorithms.py 


Minimum number of occurrences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The second algorithm, ``find_min_count``, finds the items in a list that
occur at least some specified minimum number of times.  To perform
this algorithm, you will need to:

#. Count the items
#. Build a list of the items and associated counts that meet the threshold.
#. Return that list in descending order by count.

Here is an example use of this function:

:: 

    In [6]: l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

    In [7]: basic_algorithms.find_min_count(l, 2)
    Out[7]: [('A', 5), ('B', 2), ('C', 2)]

After writing this function you should be able to pass all tests in::

    $ py.test -xv -k min_count test_basic_algorithms.py 

Frequent items
~~~~~~~~~~~~~~
The previous two algorithms require `space` (computer memory) proportional 
to the number of unique items in the list. If we have a very large list with many different items these 
algorithms can be impractical. The third algorithm, ``find_frequent``, finds
the items in a sequence with a frequency that `exceeds` a :math:`1/K`
fraction of :math:`N`, the total number of items.  ``find_frequent`` uses
space proportional to :math:`K`, which is likely to be much smaller
than :math:`N`.  (Note: the value :math:`K` used in the this algorithm
is unrelated to the value :math:`K` used in the Top K algorithm.)

``find_frequent`` uses a data structure :math:`D` with up to
:math:`K-1` counters, each associated with a particular item, to approximate 
the frequency of the corresponding items in the
list.  For a given list item :math:`I`, you will update the counters
using the following rules:

#. If item :math:`I` occurs in :math:`D`, then increment the count associated with :math:`I` by one.
#. If item :math:`I` does not occur in :math:`D` and there are fewer than :math:`K-1` items in :math:`D`, then add :math:`I` with a value of one to :math:`D`.
#. If item :math:`I` does not occur in :math:`D` and there are :math:`K-1` items in :math:`D`, then decrement all of the counters by one and remove any items with a count of zero from :math:`D`.

The output is a list of tuples with items and their associated counters. An example output of `find_frequent` follows:
::

    In [8]: l0 = ['A', 'A', 'B', 'B', 'A', 'B', 'C', 'B', 'A']

    In [9]: basic_algorithms.find_frequent(l0, 3)
    Out[9]: [('A', 3), ('B', 3)]

Notice that although ``find_frequent`` and ``top_k`` identify the same items, 
the values of the counters are different.  The ``find_frequent`` 
algorithm computes an approximation, not the exact count.

You are responsible for implementing this logic in the following skeleton code. The 
iteration process and the counter has been set up for you. You will write the logic 
that updates the counter
for each new item. If your counter contains more than :math:`K-1` items, the code will return an error (and you will know that you did something wrong).

Note:  When you are implementing Rule 3 above, remember that you *cannot* safely modify a dictionary (or list) as you iterate over it. 

To help you understand what is happening in this
algorithm, here's some output from a call to a version ``find_frequent`` that has been augmented to show the state of :math:`D` after processing each item in the list:

.. parsed-literal::

    Input:
      items: ['A', 'A', 'B', 'B', 'A', 'B', 'C', 'B', 'A']
      k: 3

    After processing...'A', the state of D is...{'A': 1}
    After processing...'A', the state of D is...{'A': 2}
    After processing...'B', the state of D is...{'A': 2, 'B': 1}
    After processing...'B', the state of D is...{'A': 2, 'B': 2}
    After processing...'A', the state of D is...{'A': 3, 'B': 2}
    After processing...'B', the state of D is...{'A': 3, 'B': 3}
    After processing...'C', the state of D is...{'A': 2, 'B': 2}
    After processing...'B', the state of D is...{'A': 2, 'B': 3}
    After processing...'A', the state of D is...{'A': 3, 'B': 3}
    
    Output: [('A', 3), ('B', 3)]

Let's return to our earlier example:

::

    In [10]: l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

    In [11]: basic_algorithms.find_frequent(l, 3)
    Out[11]: [('A', 3), ('D', 1)]


This result may seem a bit odd.  The algorithm guarantees that the
result will include any item whose frequency `exceeds` :math:`N/K`,
which is why ``'A'`` occurs in the result.  If there are fewer than
:math:`K-1` values with a frequency that exceeds :math:`N/K`, then the
result may include values that occur less frequently, which explains
the presence of ``'D'`` in the result.  Again, here is a step by step
depiction of the process:

.. parsed-literal::

  Input:
    items: ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']
    k: 3

  After processing... 'A', the state of D is...{'A': 1}
  After processing... 'B', the state of D is...{'B': 1, 'A': 1}
  After processing... 'C', the state of D is...{}
  After processing... 'A', the state of D is...{'A': 1}
  After processing... 'A', the state of D is...{'A': 2}
  After processing... 'B', the state of D is...{'B': 1, 'A': 2}
  After processing... 'A', the state of D is...{'B': 1, 'A': 3}
  After processing... 'C', the state of D is...{'A': 2}
  After processing... 'A', the state of D is...{'A': 3}
  After processing... 'D', the state of D is...{'A': 3, 'D': 1}

  Output: [('A', 3), ('D', 1)]

(Note: In your final submission, you should **not** augment your code to produce this output.
We included it for explanatory purposes only.)

This algorithm and other similar algorithms can also be used with streaming 
data, or data that arrives continuously. The
paper `Finding Frequent Items in Data Streams
<http://dimacs.rutgers.edu/~graham/pubs/papers/freq.pdf>`__ by Cormode
and Hadjieleftheriou summarizes and explains the
relationship between the frequencies reported in the results and the
true frequencies.


You can test your code by running the following script::

    $ py.test -xv -k frequent test_basic_algorithms.py 


Part 2: Analyzing Tweets
========================

Now that you have implemented the basic algorithms, you can
start analyzing Twitter feeds.  From now on, all code you write should 
be in ``analyze.py``.  This
file contains a main block to allow the program to be run from the
command-line.  Run the following command from the **Linux
command-line**::

    $ python3 analyze.py --help

to get a description of the arguments.

While we provide function headers for each of the required tasks, the
structure of the rest of the code is up to you.  
While some tasks can be done cleanly with one function, others
cannot.  We expect you to look for sub-tasks that are common to
multiple tasks and to reuse previously completed functions. This process 
of function decomposition and thoughtful reuse of functions is one of the 
keys to writing clean code. We have indicated in analyze.py where you should 
put your auxiliary functions. Remember to follow the course style guide, which 
requires that you write proper function headers for each auxiliary function.

We have provided code for testing this part in ``test_analyze.py``.
Our test suite contains tests for checking your code on the examples
shown below and on various corner cases::

    $ py.test -xv test_analyze.py 

Data
~~~~

Files
-----

Twitter enables us to search for tweets with particular
properties, say, from a particular user, containing specific terms,
and within a given range of dates.  There are several Python libraries
that simplify the process of using this feature of Twitter.  We used
the `TwitterSearch <https://pypi.python.org/pypi/TwitterSearch/>`_
library to gather tweets from the Twitter feeds of ``@Conservatives``,
``@UKLabour``, ``@theSNP``, and ``@LibDems`` from the purdah period,
and we stored the resulting data in 
`JSON <https://json.org/>`_ files.

Because these files are large, we did not include them the distribution (that is, the files we gave you).
Instead, please run the shell script ``get_large_files.sh`` to
download these files.  Change to your ``pa3/data``
directory and run this command from the Linux command-line::

    $ ./get_large_files.sh

This script will download one file per party:

- ``Conservatives.json``
- ``UKLabour.json``
- ``LibDems.json``
- ``theSNP.json``

Your VM must be connected to the network to use this
script.  Also, if you wish to use both CSIL & the Linux servers and
your VM, you will need to run the ``get_files.sh`` script twice, once
for CSIL & the Linux servers and once for your VM.

**Do not add the data to your repository!** 

Exploring the data
------------------
To simplify the process of exploring the data and testing, we have provided
example code in (``load_tweets.py``) to load tweet files and assign the
results to variables, one per party. This code is described below:

.. code-block:: python

    import util

    Conservatives = util.get_json_from_file("data/Conservatives.json")
    UKLabour = util.get_json_from_file("data/UKLabour.json")
    theSNP = util.get_json_from_file("data/theSNP.json")
    LibDems = util.get_json_from_file("data/LibDems.json")

``util.py`` defines some useful helper routines for loading the stored tweets.
The code above loads the set of tweets into each of the variables (more about the structure of 
the data later).

We have also defined variables for a few specific tweets used in the examples, as
described in the code below: 

.. code-block:: python

    # sample tweet from the "Data" section
    tweet0 = UKLabour[651]

    # sample tweet from the "Pre-processing step" and "Representing
    # N-grams" sections.
    tweet1 = UKLabour[55]

Representing tweets
-------------------

We encourage you to play around with extracting information about
hashtags, user mentions, etc. for a few tweets before you start working
on the tasks for Part 2. To do so, you will have to understand how information
in a tweet is structured. 

Each of the variables, ``theSNP``, ``Conservatives``, ``UKLabour``, and ``LibDems``,
defines a list of tweets. All of the analysis code that you will be writing
will operate on such lists.

A single tweet is represented by a dictionary that contains a lot of 
information: creation time, hashtags used, users mentioned, text of the tweet, etc.  
For example, here is a tweet sent in mid-May by ``@UKLabour`` 
(variable ``tweet0`` above):

.. code-block:: python

    'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to 
           vote here: https://t.co/2Lf9M2q3mP #GE2017…'

and here is an abridged version of the corresponding tweet
dictionary that includes a few of the 20+ key/value pairs:

.. code-block:: python

    {'created_at': 'Thu May 18 19:44:01 +0000 2017',
     'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                               {'indices': [132, 139], 'text': 'GE2017'}],
                  'symbols': [],
                  'urls': [{'display_url': 'gov.uk/register-to-vo…',
                            'expanded_url': 'http://gov.uk/register-to-vote',
                            'indices': [108, 131],
                            'url': 'https://t.co/2Lf9M2q3mP'}],
                  'user_mentions': [{'id': 1597669326,
                                     'id_str': '1597669326',
                                     'indices': [3, 15],
                                     'name': 'Patchwork Foundation',
                                     'screen_name': 'UKPatchwork'},
                                   {'id': 105573429,
                                    'id_str': '105573429',
                                    'indices': [18, 30],
                                    'name': 'Iain McNicol',
                                    'screen_name': 'IainMcNicol'},
                                   {'id': 14291684,
                                    'id_str': '14291684',
                                    'indices': [35, 44],
                                    'name': 'The Labour Party',
                                    'screen_name': 'UKLabour'}]},
     'text': 'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…'}

Collectively, hashtags, symbols, user mentions, and URLs are referred to as
`entities`. These entities can be accessed through the ``"entities"``
key in the tweet dictionary. 
The value associated with the key ``"entities"`` is itself a dictionary.
It contains a key representing a type of associated entity (``"hashtags"``, ``"symbols"``, ``"urls"``, ``"user_mentions"``) to a list of all of those entities present in the tweet.
The entities themselves are represented with dictionaries dependent upon the entity type.  


Finding commonly-occurring entities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What are ``@theSNP``'s favorite hashtags? Which URLs are referenced at
least 5 times by ``@LibDems``? Are there users who represent at least 5%
of the mentions in ``@Conservatives`` tweets?  To answer these questions
we must extract the desired entities (hashtags, user mentions,
etc.) from the parties' tweets and process them.

The ``entities`` dictionary maps a key to a list of 
entities associated with that key (each of which is a dictionary):

.. code-block:: python

   'entities': {'key': [{'subkey1': value11, 'subkey2': value12},
                        {'subkey1': value21, 'subkey2': value22}]}

For example, the hashtags contained in a tweet are represented by:

.. code-block:: python

    'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                               {'indices': [132, 139], 'text': 'GE2017'}]}

The next three tasks will use two of the same parameters.  To avoid
repetition later, we describe them here:

- ``tweets`` is a list of dictionaries representing tweets.  
- ``entity_key`` is one of the following pairs (corresponding to a key and a subkey): ``("hashtags", "text")``, ``("urls", "url")``, or ``("user_mentions", "screen_name")``.  The first item in the pair tells us the type of entity and the second tells us the key to lookup in that entity. 

For the tasks that use tweets, we will ignore case differences and so,
for example, we will consider ``"#bbcqt"`` to be the same as ``"#BBCQT"``. 
In this assignment, you should *convert everything to lowercase*.
If you find yourself under-counting some entities, 
verify that you are correctly ignoring case differences. 


Task 1: Top K entities
----------------------
For Task 1,  you will write a function that finds the 
top `k` most common entities in a list of tweets 
using the algorithm you wrote in `basic_algorithms.py`.
You must complete the following function:

.. code-block:: python

    def find_top_k_entities(tweets, entity_key, k):

The first two parameters are as described above and ``k`` is an
integer.  This function, which is in ``analyze.py``, should return a
sorted list of (entity, count) pairs for the ``k`` most common entities, 
where `count` is the number of occurrences of `entity`.

Here's a sample call using the tweets from ``@theSNP`` with ``("hashtags",
"text")`` as the entity key and ``k=3``:

.. code-block:: python

    In [13]: analyze.find_top_k_entities(theSNP, ("hashtags", "text"), 3)
    Out[13]: [('votesnp', 625), ('ge17', 428), ('snpbecause', 195)]

After completing this task, you should pass all tests in::
    
    $ py.test -xv -k task1 test_analyze.py 

The API tests are simplified examples to test corner cases and the remaining tests use real data in the data folder.

Task 2: Minimum number of occurrences
-------------------------------------
For Task 2, we will find all entities that appear a minimum number of times
using the ``min_count`` algorithm you wrote before. 
You must complete the function`:

.. code-block:: python 

    def find_min_count_entities(tweets, entity_key, min_count):

where the first two parameters are as described above and
``min_count`` is an integer that specifies the minimum number of times
an entity must occur to be included in the result.  This function
should return a sorted list of (entity, count) pairs.

Here's a sample use of this function using the tweets from
``@LibDems`` with ``("user_mentions", "screen_name")`` as the entity key
and ``min_count=100``:

.. code-block:: python

    In [14]: analyze.find_min_count_entities(LibDems, ("user_mentions", "screen_name"), 100)
    Out[14]: 
    [('libdems', 568),
     ('timfarron', 547),
     ('liberalbritain', 215),
     ('libdempress', 115)]

After completing this task, you should pass all tests in::
    
    $ py.test -xv -k task2 test_analyze.py


Task 3: Frequent entities
-------------------------
For Task 3, you will use the `frequent` algorithm you previously implemented 
identify commonly occuring entities. 
You must implement the function:

.. code-block:: python

    def find_frequent_entities(tweets, entity_key, k)

The first two parameters are as described above and ``k`` is the same integer
as ``k`` in your basic algorithm ``find_frequent``.

This function should return a sorted list of (entity, count) pairs
computed using the `frequent` algorithm.

Here's a sample use of this function on the tweets from
``@Conservatives`` with ``("hashtags", "text")`` as the entity key
and ``k=5``:

.. code-block:: python

    In [15]: analyze.find_frequent_entities(Conservatives, ("hashtags", "text"), 5)
    Out[15]: [('bbcqt', 94), ('ge2017', 64), ('chaos', 33), ('voteconservative', 1)]


After completing this task, you should pass all tests in::
    
    $ py.test -xv -k task3 test_analyze.py


Analyzing N-grams
~~~~~~~~~~~~~~~~~

What additional insights might we gain by analyzing words and word sequences? 

In this part, you will apply the three algorithms described earlier to
contiguous sequences of :math:`N` words, which are known as `n-grams`.
Before you apply these algorithms to a candidate's tweets, you will
pre-process the tweets to  reveal salient words and then extract
the n-grams. Your solution should use helper functions 
to avoid duplicated code.

Pre-processing step
-------------------

The pre-processing step converts the text of a tweet into a list of
strings. It is important to  follow the order of the steps below precisely so 
that your solution passes our tests. You can find all named constants in 
the file ``clean.py`` and can refer to them by their names in your code.

1. We will define a `word` to be any sequence of characters delimited by
whitespace.  For example, "abc", "10", and "#2017election." are all
words by this definition. *Turn the text of a tweet into a list of lowercase words.*

2. Next, we must remove any punctuation.
We have defined a constant, ``PUNCTUATION``, that specifies which
characters constitute punctuation for this assignment.  Note that
apostrophes, as in the word "it's", should not be removed, because
they occur in the middle of the word. 
*Remove any leading and trailing punctuation.* 

3. Then, we want to focus attention on the important words in the
tweets. *Stop words* are
commonly- occurring words, such as "and", "of", and "to" that convey
little insight into what makes one tweet different from another.
We defined a large set of *stop words* including symbols and emojis that
appear in the defined constant ``STOP_WORDS``. *Eliminate all stop 
words that are in the set* ``STOP_WORDS``

4. Finally, we want to remove URLs, hashtags, and mentions. We defined a
constant ``STOP_PREFIXES`` (which includes ``"#"``, ``"@"``, and ``"http"``). *Eliminate 
all words that begin with any of the strings in* ``STOP_PREFIXES``

Here is an example: pre-processing the text (see ``tweet1`` defined in ``load_tweets.py``)

.. parsed-literal::

    Things you need to vote
    Polling card? ✘ 
    ID? ✘ 
    A dog? ✘
    Just yourself ✔
    You've got until 10pm – #VoteLabour now… https://t.co/sCDJY1Pxc9

would yield:

.. code-block:: python

    ('things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
     'just', 'yourself', "you've", 'got', 'until', '10pm', 'now')

You will find the ``lower``, ``split``, ``strip``, and ``startswith``
methods from the `string API <https://docs.python.org/3/library/stdtypes.html#string-methods>`_
useful for this step.  You will save yourself a lot of time and
unnecessary work if you read about these methods in detail *before*
you start writing code. 

Note: this processing will not be perfect. There are constantly new emojis and some
will inevitably slip through the cracks. You are only responsible for filtering those
defined in ``clean.py``. An example of such an issue is the following tweet:
..
    >>> theSNP[497]['text']
    'RT @LadyM_McManus: 👏👏👏 https://t.co/HbeQfmK6Pa'


Representing N-grams
--------------------

Your code should first compute the n-grams of a tweet after
pre-processing the tweet's text.  These n-grams should be represented
as tuples of strings.  Given a value of 2 for :math:`N`, the above
``@UKLabour`` tweet would yield the following bi-grams (2-grams):

.. code-block:: python

    [('things', 'need'),
     ('need', 'vote'),
     ('vote', 'polling'),
     ('polling', 'card'),
     ('card', 'id'),
     ('id', 'dog'),
     ('dog', 'just'),
     ('just', 'yourself'),
     ('yourself', "you've"),
     ("you've", 'got'),
     ('got', 'until'),
     ('until', '10pm'),
     ('10pm', 'now')]


Notice that the n-gram does not "circle back" to the beginning. That is, the 
last word of the tweet and the first word of the tweet do not comprise an 
``n-gram`` (notice: ('now', 'things') is not included).

Common parameters
-----------------

As in Tasks 1-3, Tasks 4-6 have two parameters in common:

- ``tweets`` is a list of  dictionaries representing tweets (for example, ``theSNP``)
- ``n`` is the number of words in an n-gram.


Task 4: Top K n-grams
---------------------

Your task is to implement the function, which applies your previously written ``find_top_k`` function to find the most commonly occuring n-grams:

.. code-block:: python

    def find_top_k_ngrams(tweets, n, k):

where the first two parameters are as described above and ``k`` is the
desired number of n-grams. This function should return a sorted list
of the (n-gram, count) pairs for the ``k`` most
common n-grams.

Here's a sample use of this function using the tweets from ``@theSNP``
with ``n=2`` and ``k=3``:


.. code-block:: python

    In [16]: analyze.find_top_k_ngrams(theSNP, 2, 3)
    Out[16]: [(('nicola', 'sturgeon'), 81), (('read', 'more'), 67), (('stand', 'up'), 55)]

After completing this task, you should pass all tests in::
    
    $ py.test -xv -k task4 test_analyze.py


Task 5: Minimum number of n-gram occurrences
--------------------------------------------

Similarly, we will apply the  `find_min_count` to find 
the n-grams that appear at least a minimum number of times.
Your task is to implement the function:

.. code-block:: python

    def find_min_count_ngrams(tweets, n, min_count):

where the first two parameters are as described above and
``min_count`` specifies the minimum number of times an n-gram must
occur to be included in the result.  This function should return a
sorted list of (n-gram, count) pairs.

Here's a sample use this function using the tweets from ``@LibDems``
with ``n=2`` and ``min_count=100``:

.. code-block:: python

    In [17]: analyze.find_min_count_ngrams(LibDems, 2, 100)
    Out[17]: 
    [(('stand', 'up'), 189),
     (('stop', 'tories'), 189),
     (('will', 'stand'), 166),
     (('theresa', 'may'), 125),
     (('lib', 'dems'), 116),
     (('can', 'stop'), 104),
     (('only', 'can'), 100)]

After completing this task, you should pass all tests in::
    
    $ py.test -xv -k task5 test_analyze.py


Task 6: Frequent n-grams
------------------------

Then, we will apply the  `find_frequent` function to find 
the n-grams to find commonly occuring n-grams with the frequent 
algorithm.
Your task is to implement the function:

.. code-block:: python
                                                             
    def find_frequent_ngrams(tweets, n, k)

where the first two parameters are as described above and ``k`` is an
integer.  This function should return a a sorted list of (n-gram,
counter value) pairs as computed using the `Frequent`
algorithm.

For example, using this function on the tweets from ``@Conservatives``
with ``n=2`` and ``k=4`` would yield:

.. code-block:: python

    In [18]: analyze.find_frequent_ngrams(Conservatives, 2, 4)  
    Out[18]: [(('canvassing', 'session'), 1),
              (('first', 'canvassing'), 1),
              (('session', 'mk'), 1)]

After completing this task, you should pass all tests in::
    
    $ py.test -xv -k task6 test_analyze.py

EDIT: We noticed some inconsistencies between versions of the emoji
library and these inconsistencies can lead to different results with
frequent algorithm when k is small. We have decided to remove
test_task6_0. The writeup is consistent with the latest version of the
libary available on the CSIL machines and the course VM.


By-month analysis
~~~~~~~~~~~~~~~~~

Task 7
------

Your last task is to implement the function:

.. code-block:: python

   def find_top_k_ngrams_per_month(tweets, n, k):

which takes the same parameters as ``find_top_k_ngrams``. It returns a
list of tuples that match the year and month of tweets to  
the top ``k`` ``n``-grams. This list of tuples will have the form:
((year, month), sorted list of the top ``k`` ``n``-grams and their counts). 
The function ``util.grab_year_month`` will be useful for this task. 

You should use the built-in Python 
``sort`` function (not ``sort_count_pairs``) in order to sort the list of 
tuples. 

Here's a sample use of this function using the tweets from ``@UKLabour``
with two as the value for ``n`` and five as the value for ``k``:

::

    In [19]: analyze.find_top_k_ngrams_by_month(UKLabour, 2, 5)
    Out[19]: 
    [((2017, 4), 
      [(('nhs', 'staff'), 15),
       (('labour', 'will'), 13),
       (('labour', 'government'), 11),
       (('register', 'vote'), 11),
       (('2', 'mins'), 9)]),
     ((2017, 5),
      [(('labour', 'will'), 51),
       (('register', 'vote'), 39),
       (('our', 'manifesto'), 26),
       (('tories', 'have'), 25),
       (('not', 'few'), 24)]),
     ((2017, 6),
      [(('labour', 'will'), 39),
       (('labour', 'gain'), 36),
       (('8', 'june'), 22),
       (('voting', 'labour'), 20),
       (('vote', 'labour'), 17)])]


Because the results can be a hard to decipher in this form, we
have provided a function, ``util.pretty_print_by_month``, that takes
the output of ``find_top_k_ngrams_per_month`` and prints it in an
easier-to-read form.  For example, here is a sample use of this function:

::

    In [20]: util.pretty_print_by_month(analyze.find_top_k_ngrams_by_month(UKLabour, 2, 5))
    April 2017:      nhs staff           15
                     labour will         13
                     labour government   11
                     register vote       11 
                     2 mins              9

    May 2017:        labour will         51
                     register vote       39
                     our manifesto       26
                     tories have         25 
                     not few             24

    June 2017:       labour will         39
                     labour gain         36
                     8 june              22
                     voting labour       20
                     vote labour         17

After completing this task, you should pass all tests in::
    
    $ py.test -xv -k task7 test_analyze.py

Grading
=======

Programming assignments will be graded according to a general rubric.
Specifically, we will assign points for completeness, correctness,
design, and style.  (For more details on the categories, see our `PA
Rubric page <../rubric.html>`__.)

The exact weights for each category will vary from one assignment to
another.  For this assignment, the weights will be:

* **Completeness:** 50%
* **Correctness:**  20% 
* **Design:** 15%
* **Style:** 15%

While we are telling you what functions to implement in this assignment,
some of the tasks will benefit from using helper functions. Your design
score will depend largely on whether you make adequate use of helper
functions, as well as whether your code is well structured and not too
convoluted.

As usual, we may deduct points if we spot errors in your code that
are not explicitly captured by the tests. In this assignment, we will
also be paying special attention to the efficiency of your solutions.
For example, if you write a solution that uses a doubly-nested loop
when a single loop would've been enough (or which iterates over a data
structure multiple times when a single pass would've been enough) we
would deduct correctness points for this.

Finally, remember that you *must* include header comments
in any functions you write (and you must make sure they conform to the format
specified in the style guide). Do not remove the header comments
on any of the functions we provide.


Obtaining your test score
~~~~~~~~~~~~~~~~~~~~~~~~~

Like previous assignments, you can obtain your test score by running ``py.test`` followed by ``../common/grader.py``.

Continuous Integration
======================

Continuous Integration (CI) is available for this assignment. For more details, please
see our `Continuous Integration <../../ci/index.html>`__ page. We strongly
encourage you to use CI in this and subsequent assignments.


Submission
==========

See `these submission instructions <startup-alone.html#submission>`__  if you are working alone.

See `these submission instructions <startup-pair-continuing.html#submission>`__ if you are working in the same pair as for PA #2.

See `these submission instructions <startup-pair.html#submission>`__ if you are working in a *NEW* pair.