Eleni Kalamara, Arthur Turrell, Chris Redl, George Kapetanios, and
Sujit Kapadia, "Making Text Count: Economic Forecasting Using
Newspaper Text", Journal of Applied Econometrics, Vol. 37, No. 5,
2022, pp. 896-919.

Everything (code and data) is in a single zip file called
ktrkk-files.zip. The rest of this readme spells out what's in each
directory and how to run the code to generate the results.


## Data

Some data are in csv files. Bigger datasets, eg the results from
running models, are in .pkl format, aka 'pickle' format. You can find
more information on pickle files and how to read them here
(https://docs.python.org/3/library/pickle.html#data-stream-format). We
apologise for not making these data available in a cross-language
format such as parquet; the choice to use pickle goes back to the
start of the project.


### Benchmark Data

This contains csv files with series in such as the Bank of England
measure of uncertainty, RICS House Price Balance, statistical series
like UK GDP, and much more. Unfortunately, as many of the series were
created internally from confidential data at the Bank of England or
are proprietary, we are not able to share them.


### Raw Data

The original data, as downloaded from Dow Jones (and stored in
`DataSources/DowJones`), or put together into single-newspaper csv
files (in `data/raw`) are commercial and therefore cannot be included
in this replication packet. However, the original query of the Dow
Jones API is included along with all subsequent processing code. To
get these data, you will need to use the Dow Jones API and run the
same query as is in
`DataSources/DowJones/770cf8c9-5da1-4a02-a57a-2d101b114b57_query.json`.


### Intermediate Data

These are computed text metrics and term-frequency matrices. They are
found in `data/intermed/`. They are not included in the replication
packet because they contain article-level data (such as counts of
words by article).


### Results Data

These are found in `data/results/` and are used to create all of the
charts and tables for the paper. They are included in the replication
packet.


## Guide to repository

A brief guide to what's in each folder to produce the paper:

-- download_raw_from_api contains the code to interact with Dow Jones
APIs, most importantly to pull down data in the form of split up .avro
files from the snapshot API. Other code in the folder processes these
data and turns them into a csv containing a single newspaper's text
per file.

-- DataSources/DowJones contains the raw data that is downloaded from
the API before it gets turned into a single newspaper file and stored
in data/raw. DataSources/DowJones also contains the queries passed to
the API, some response information, and some analysis of the response
information. Note that this data cannot be shared in the replication
packet. However, the original query is included.

-- data/raw contains one csv file per newspaper (the contents of this
folder are created by the code in `download_raw_from_api/`). (Data not
included in replication packet.)

-- data/intermed contains intermediate data, such as the
term-frequency matrices (data not included in replication packet)

-- data/results contains the results of all of the forecasts

-- Dictionaries contains the various dictionaries used to compute the
algorithmic text metrics

-- benchmark contains the non-text data used in forecasts (data not
included in replication packet)

-- tests contains code to run tests on some key components of the code

-- output-scratch is a folder where temporary work may be stored; this
is empty in the replication packet

-- output houses outputs that are used in the paper. The folder
contents are not included in the replication packet.

-- creds is a folder to include credentials for various code-based
services (eg cloud), and will be empty in the replication packet

-- envconfig contains a yaml file for doing 'batch' cloud computing
runs on Microsoft's Azure cloud service.

-- latex contains the LaTeX code for the paper. The folder contents
are not included in the replication packet. Similarly for latex_slides
folder.


## The Code

### Operating System

A dockerfile is contained with the project. Although you can run the
code locally, using the environment detailed in the next section, you
can ensure the code is executed in the same OS using the dockerfile.
You will need to build the dockerfile image. Once the dockerfile is
built, use it interactively with 

`docker run -v /full/path/to/MakingTextCount:/home/ --rm -it image_name`

on the command line to mount a folder called MakingTextCount as a
volume that is kept in-sync with the local machine directory.


### Python Environment

The reproducible code environment may be found in `mtcenv.yml`. To
create the environment, run

```python
conda env create -f mtcenv.yml
```

on the command line. This creates an environment called `mtcenv` that
is used to run all of the code. To use this environment in, for
example, VS Code, use the "Python: Select Interpreter" command from
the Command Palette (⇧⌘P on Mac) or change it in the lower left-hand
side of the window, on the blue status bar. On the command line, the
environment can be activated with `conda activate mtcenv`.


### Running Code from the Project

**'main.py' contains a series of commands that runs all of the
analysis on a single machine**. Code should be executed from the root
directory using the `mtcenv` environment. Imports are always done
relative to this, e.g. ```import src.foobar as foo```. However, the
machine learning forecasts are extremely computationally intensive
(and there are a lot of them), so it's more efficient to run the
forecasts split over many cloud computing instances. We used Azure
Batch jobs to do this.


#### Running Many Forecasts On Cloud: Workflow for Azure Batch Shipyard Jobs

-- git clone code onto Virtual Machine (VM)

-- transfer code onto blob storage account from VM using Azure CLI
tools

-- upload any data files onto blob storage from VM or laptop using
Azure CLI tools

-- submit dockerfile with correct environment

-- create worker pool

-- submit workers


#### Code structure

To reproduce results, put csvs (one per newspaper) in raw/ and make
sure same names are in config file under papers. Then executing the
imports and commands in `main.py` follows the structure below to get
outputs. Not shown below is the output-scratch folder that is used for
outputs that will not be included in the paper.

![alt text](MTC_org_chart.png)


#### Running tests

Run the below on the command line (with the `mtcenv` activated):

```bash
pytest src/tests.py
```

### Project settings

These may be found in config.ini, and set everything from which
newspapers are included in the analysis to the data visualisation
options.


### LaTeX

Because the .eps files used for figures are not in a sub-directory of
the main .tex files, you must add a flag to the Latex compiler. In
TexShop, the steps are:

-- Go to Preferences

-- Go to Tab "Engine"

-- Go to the field "pdfTeX"

-- In the LaTeX Input Field add `--shell-escape` at the end so that it
changes from `pdflatex --file-line-error --synctex=1` to
`pdflatex --file-line-error --synctex=1 --shell-escape`

## Dictionaries

-- The Loughran and McDonald dictionary is from
https://www3.nd.edu/~mcdonald/Word_Lists.html. It is used in Loughran,
T., & McDonald, B. (2011). When is a liability not a liability?
Textual analysis, dictionaries, and 10‐Ks. *The Journal of Finance*,
66(1), 35-65.

-- The anxiety/excitement dictionary is from Nyman, R., Gregory, D.,
Kapadia, S., Ormerod, P., Tuckett, D., & Smith, R. (2015). News and
narratives in financial systems: exploiting big data for systemic risk
assessment. *Bank of England*, mimeo.

-- The Harvard Enquirer Sentiment dictionary is from
http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm.

-- The AFINN dictionary; a reference is Finn Årup Nielsen, "A new
ANEW: Evaluation of a word list for sentiment analysis in
microblogs",Proceedings of the ESWC2011 Workshop on 'Making Sense of
Microposts': Big things come in small packages 718 in CEUR Workshop
Proceedings : 93-98. 2011 May, http://arxiv.org/abs/1103.2903

-- The Fed dictionary of financial stability
https://www.federalreserve.gov/econres/notes/ifdp-notes/constructing-a-dictionary-for-financial-stability-20170623.htm