Eleni Kalamara, Arthur Turrell, Chris Redl, George Kapetanios, and Sujit Kapadia, "Making Text Count: Economic Forecasting Using Newspaper Text", Journal of Applied Econometrics, Vol. 37, No. 5, 2022, pp. 896-919. Everything (code and data) is in a single zip file called ktrkk-files.zip. The rest of this readme spells out what's in each directory and how to run the code to generate the results. ## Data Some data are in csv files. Bigger datasets, eg the results from running models, are in .pkl format, aka 'pickle' format. You can find more information on pickle files and how to read them here (https://docs.python.org/3/library/pickle.html#data-stream-format). We apologise for not making these data available in a cross-language format such as parquet; the choice to use pickle goes back to the start of the project. ### Benchmark Data This contains csv files with series in such as the Bank of England measure of uncertainty, RICS House Price Balance, statistical series like UK GDP, and much more. Unfortunately, as many of the series were created internally from confidential data at the Bank of England or are proprietary, we are not able to share them. ### Raw Data The original data, as downloaded from Dow Jones (and stored in `DataSources/DowJones`), or put together into single-newspaper csv files (in `data/raw`) are commercial and therefore cannot be included in this replication packet. However, the original query of the Dow Jones API is included along with all subsequent processing code. To get these data, you will need to use the Dow Jones API and run the same query as is in `DataSources/DowJones/770cf8c9-5da1-4a02-a57a-2d101b114b57_query.json`. ### Intermediate Data These are computed text metrics and term-frequency matrices. They are found in `data/intermed/`. They are not included in the replication packet because they contain article-level data (such as counts of words by article). ### Results Data These are found in `data/results/` and are used to create all of the charts and tables for the paper. They are included in the replication packet. ## Guide to repository A brief guide to what's in each folder to produce the paper: -- download_raw_from_api contains the code to interact with Dow Jones APIs, most importantly to pull down data in the form of split up .avro files from the snapshot API. Other code in the folder processes these data and turns them into a csv containing a single newspaper's text per file. -- DataSources/DowJones contains the raw data that is downloaded from the API before it gets turned into a single newspaper file and stored in data/raw. DataSources/DowJones also contains the queries passed to the API, some response information, and some analysis of the response information. Note that this data cannot be shared in the replication packet. However, the original query is included. -- data/raw contains one csv file per newspaper (the contents of this folder are created by the code in `download_raw_from_api/`). (Data not included in replication packet.) -- data/intermed contains intermediate data, such as the term-frequency matrices (data not included in replication packet) -- data/results contains the results of all of the forecasts -- Dictionaries contains the various dictionaries used to compute the algorithmic text metrics -- benchmark contains the non-text data used in forecasts (data not included in replication packet) -- tests contains code to run tests on some key components of the code -- output-scratch is a folder where temporary work may be stored; this is empty in the replication packet -- output houses outputs that are used in the paper. The folder contents are not included in the replication packet. -- creds is a folder to include credentials for various code-based services (eg cloud), and will be empty in the replication packet -- envconfig contains a yaml file for doing 'batch' cloud computing runs on Microsoft's Azure cloud service. -- latex contains the LaTeX code for the paper. The folder contents are not included in the replication packet. Similarly for latex_slides folder. ## The Code ### Operating System A dockerfile is contained with the project. Although you can run the code locally, using the environment detailed in the next section, you can ensure the code is executed in the same OS using the dockerfile. You will need to build the dockerfile image. Once the dockerfile is built, use it interactively with `docker run -v /full/path/to/MakingTextCount:/home/ --rm -it image_name` on the command line to mount a folder called MakingTextCount as a volume that is kept in-sync with the local machine directory. ### Python Environment The reproducible code environment may be found in `mtcenv.yml`. To create the environment, run ```python conda env create -f mtcenv.yml ``` on the command line. This creates an environment called `mtcenv` that is used to run all of the code. To use this environment in, for example, VS Code, use the "Python: Select Interpreter" command from the Command Palette (⇧⌘P on Mac) or change it in the lower left-hand side of the window, on the blue status bar. On the command line, the environment can be activated with `conda activate mtcenv`. ### Running Code from the Project **'main.py' contains a series of commands that runs all of the analysis on a single machine**. Code should be executed from the root directory using the `mtcenv` environment. Imports are always done relative to this, e.g. ```import src.foobar as foo```. However, the machine learning forecasts are extremely computationally intensive (and there are a lot of them), so it's more efficient to run the forecasts split over many cloud computing instances. We used Azure Batch jobs to do this. #### Running Many Forecasts On Cloud: Workflow for Azure Batch Shipyard Jobs -- git clone code onto Virtual Machine (VM) -- transfer code onto blob storage account from VM using Azure CLI tools -- upload any data files onto blob storage from VM or laptop using Azure CLI tools -- submit dockerfile with correct environment -- create worker pool -- submit workers #### Code structure To reproduce results, put csvs (one per newspaper) in raw/ and make sure same names are in config file under papers. Then executing the imports and commands in `main.py` follows the structure below to get outputs. Not shown below is the output-scratch folder that is used for outputs that will not be included in the paper. ![alt text](MTC_org_chart.png) #### Running tests Run the below on the command line (with the `mtcenv` activated): ```bash pytest src/tests.py ``` ### Project settings These may be found in config.ini, and set everything from which newspapers are included in the analysis to the data visualisation options. ### LaTeX Because the .eps files used for figures are not in a sub-directory of the main .tex files, you must add a flag to the Latex compiler. In TexShop, the steps are: -- Go to Preferences -- Go to Tab "Engine" -- Go to the field "pdfTeX" -- In the LaTeX Input Field add `--shell-escape` at the end so that it changes from `pdflatex --file-line-error --synctex=1` to `pdflatex --file-line-error --synctex=1 --shell-escape` ## Dictionaries -- The Loughran and McDonald dictionary is from https://www3.nd.edu/~mcdonald/Word_Lists.html. It is used in Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. *The Journal of Finance*, 66(1), 35-65. -- The anxiety/excitement dictionary is from Nyman, R., Gregory, D., Kapadia, S., Ormerod, P., Tuckett, D., & Smith, R. (2015). News and narratives in financial systems: exploiting big data for systemic risk assessment. *Bank of England*, mimeo. -- The Harvard Enquirer Sentiment dictionary is from http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm. -- The AFINN dictionary; a reference is Finn Årup Nielsen, "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs",Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop Proceedings : 93-98. 2011 May, http://arxiv.org/abs/1103.2903 -- The Fed dictionary of financial stability https://www.federalreserve.gov/econres/notes/ifdp-notes/constructing-a-dictionary-for-financial-stability-20170623.htm