• Bocconi Students Options Club

A Quants Journey, Episode 2: Web Scraping for Finance

Updated: Feb 1

Can financial and accounting data predict stock performance?


From an economist’s perspective, the answer to this and the following episodes question is a clear no. Accounting data is public information, markets are efficient in the sense that all the resulting expectations are already priced in. The market price for an asset reflects its true value.


On the other hand, you or your typical investor would not buy a stock where you do not believe in the so-called fundamentals. But is his research giving you a competitive advantage or is it merely a waste of time since the “efficient market hypothesis” holds?


Let´s investigate this question over the course of the next episodes by applying machine learning techniques to forecast stock performance.


The schedule will look as follows:

· Episode 2 (today): Write some scripts to mine the necessary data.

· Episode 3: Run OLS- and Probit-Regressions.

To compute the most KPIs I want to use for forecasting, I will be needing the company´s balance sheets and statements of profit and loss. After doing some research, I found a quite convenient function which would return said data for a given stock ticker. It works in a comparable fashion to the yfinancials library I used in the last episode for the historical prices.


This function will return the past three annual financial statements for the respective firm since for older data a yahoo premium account would be required. I might consider a subscription once the algorithm actually generates some returns… Anyways, for later use I store the data in a .csv-file.

The data we get is – unfortunately – a bit messy and not very easy to work with without some maintenance. Let me give you a taste of how it is structured:

The initial idea was to have a nested dictionary. These dictionaries work like the folder-with-subfolder system you use for your own data in, say, the documents folder on your PC. Your first “key” (e.g., AAPL) opens a “value” (e.g., 2019-12-31) which in turn serves as key for another item (e.g., intangible Assets) which again served as a key yielding (finally) the associated value, say $10 Bn.

So, if you know how to “unpeel the onion” you get access to the right and specific item of your choice which you can then read out and use for computations. Unfortunately, there are two major problems:


  1. The original Data Structure. Dictionaries are indicated by curly brackets {“key”:”value”}. If you look closely at the mess above (Figure 1), you can see that the first “value” is in fact a list, indicated by the square brackets []. What we have here is list of dictionaries in a dictionary which makes the unpeeling (unnecessarily) complicated. In the end my algorithm is iterating over every item in that list (one for every year, so three in total) to check if it is equal to the desired date before he can go another layer further.

  2. Speaking about dates. Eventually, the Algorithm is supposed to make investment decisions based upon all the information available up to the “present”. If he is supposed to make investment decisions for 2018, he must not know more than the data available up to 31st December 2017. Taking the 31st Dec as the date seems very convenient since firms usually publish their annual financial reports with respect to this date. Usually.

Of course, there are firms whose auditors take a bit longer or are faster while some companies even release their financial statements in summer. I decided the following: Deadline would be in February and only firms who published their numbers in the three months prior to that would be considered. This way, all the statements from January would be included and the ones from end of December would still be reasonably current. Together they make up the largest part of the S&P 500.

A little script I wrote to store the correct dates for each firm for the years 2016-2019 got named get_dates(). Of course, it would be a cheap shot at this point to make a joke about programmers struggling in getting actual romantic dates. Also, I can´t think of a good one right now.

The next step of get_dates() is to open the .csv-file, enter the second layer with the list of available dates, turn each of the three str objects (or in other words: plain text) into a datetime object (see above) and test whether the date is within the desired range.

If we get a hit, we convert the correct date back to a string (text) and store it for later usage.


Computing Key Performance Indicators


Now that we can conduct automated queries on all the available data from the financial statements, we come to a less tedious and more enjoyable part of this episode. For my first Machine Learning Attempts, I want to feed the machine the following explanatory variables:

- Return on Assets and/or Equity (RoA/RoE)

- Current Ratio and/or Debt-to-Equity Ratio

- Sales Growth (Sales2017/Sales2016)

- Price-to-Book Ratio (P/B)

- Earnings per Share (EPS)

- Price-Earnings Ratio (P/E)


But if you really paid good attention during your accounting classes (I certainly did not), you will see a problem with computing the last two ratios. If not, here is the answer: You will be needing the number of shares outstanding along with the stock price. Both are figures you will neither find on a Balance Sheet nor a Statement of Profit and Loss or Cashflow Statement. While we already got the historical prices in the last episode using yfinance, this leaves us with the question how to get the number of shares a firm has issued in the past.

Of course, there are websites like statista.com or macrotrends.net which will provide such information. But rather than copy-pasting everything by hand, I would rather have the computer, or an intern do it for me.



While currently unfortunately lacking the latter, the PC could do that work for me. It should scan the source code of the website (Figure 2), search for a table called “Shares Outstanding” and mysteriously find the correct row/column within. Also, I do not want to have to feed him the correct URL for every stock by hand, so the website must have some sort of coherent structure which can be exploited for the web scraping. What do I mean by that?

Let´s have a look at these two URLs of macrotrends.net and sharesoutstandinghistory.com:

- https://www.macrotrends.net/stocks/charts/AAPL/apple/shares-outstanding

- https://www.sharesoutstandinghistory.com/?symbol=aapl


While the website sharesoutstandinghistory.com only requires typing in the correct symbol (appl), Macrotrends also needs the company´s name (apple). This is a problem, because the way macrotrends has named the firms in its database follows a logic I cannot know. The slightly sketchy looking website sharesoutstandinghistory.com (Figure 3) it is then.


Using the pandas function read_html() which we feed the respective URL for each firm by plugging in the correct ticker, we get all the tables read_html() can find in the websites source code. By try-and -error I figured out that table number two would always be the one containing the right data for this website.

As you can see in the snapshot above (Figure 3), the data in this table is not formatted like the ones in Figure 1 and Figure 2. For converting the dates from month-day-year to year-month-day, these two lines of code do the trick:

Also, the website displays the values using B for billion and M for million as well as decimal points. Let´s fix this quickly:

Done.


Really Computing Key Performance Indicators


Last thing for today will be the actual computation of the KPIs. It is crazy to think about how long this episode has become by me trying to summarize the tedious work just to finish it off with seven lines of code like this:

And:

Yep, that´s it - for now.

The ratios and indicators I selected surely do not contain all the interesting data about the firms. If somebody in the readership has a favorite KPI (I assume there are accountants who have one) that she or he wants to see considered, please hit me up again. Since this is going to be shared on LinkedIn, let me end the episode it in this way:

I am humbled and proud to announce that I am grateful and honored by your feedback (also, I am still looking for a summer internship… #GoldmanSachs #JPMorgan #Berliner Sparkasse).

Seriously, thank you again for reading and reaching out to me and see you soon for episode 4 of A Quants Journey !

Stay tuned and most importantly, stay healthy

Friedrich