FundApps provides a service to automate shareholding disclosure.
This includes software used to capture regulatory rules (legal information is provided via aosphere and rules are created and managed by our dedicated rule team). However, this is not the end of the story. There is an underestimated component associated with true automation and it applies whether you manage shareholding disclosure using Rapptr or using an inhouse system.
In short, keeping on-top of constantly changing regulatory data is a significant task.
My job is to source and interpret regulatory data, transforming unstructured text into a machine readable format. Some sources are simple. For example, the nice people at the UK Takeover Panel provide an XML file which is pretty much perfect.
Unfortunately, most regulatory pages are much more difficult to scrape. One of our favourites is the HK Takeovers & Mergers list!
Step 1 - Find the best source of information:
After a bit of Googling we’re off to a good start and find than the Hong Kong Securities and Futures Commission provide a table of active takeovers and mergers with a HTML table containing the data we need:
The nature of real-world data is that it regularly changes, and is rarely perfectly formatted, so, on one hand you want to be fairly permissive, but on the other hand you don't want to be reading garbage if the structure gets completely changed. Given that we're going to be using this information for very important purposes, we make the scrapers confirm that they have 100% confidence in their source. This is the art of writing a good scraper.
Using the example of the Hong Kong page, rather than looking at only the new offers, the table under the heading "Current offer periods..." provides a more complete source of information:
Changes that we might see to this page in the future and want to consider handling are:
- New columns being added
- Columns in the table being renamed or removed
- The date being written in a different format
- An offer having multiple offerors
Step 2 - Let’s find some identifiers:
Great, we’ve now got the names of the offerees and offerors, and the dates when the offer period commenced and was announced! Unfortunately, Rapptr needs a little more information than just company names. Namely, we need an identifier (an ISIN) for the securities of this issuer. Fortunately, we can go from the regulator’s HKex website, where after a bit of digging we will find a page with two excel sheets:
Step 3 - Let’s find the right identifiers
As we look for names from the takeover offer table in the two spreadsheets, some of the names are exact matches, if you simply ignore cases. E.g. “ASR Logistics Holdings Limited” matches:
However, some names are written differently, for example, using abbreviations, so when looking for “Enviro Energy International Holdings Limited” the row we want to match is:
To create a reliable but adaptable name matching we’ve used elements from the open-source Apache Lucene project to help us build the following pipeline
- Take a piece of text
- Split it into individual words using the Lucene standardtokenizer
- Remove ‘s from the end of words and remove dots from acronyms
- Make the words lower-case
- Remove common English words such as “at” “and” “as” “to”
- Normalise potential synonyms (replace “ltd” with “limited”, “hldgs” with “holdings” etc)
We apply this process with all the words from both the table and the spreadsheets, then use the results to find matches.
Step 4: Making it all robust
URLs change, web-servers can be unresponsive or can go down completely so FundApps’ scraping processes have to be so robust that they can reschedule themselves to check the website again in a few minutes or to look around until they’ve found the data they expect.
Our automated scrapers run every two hours, examining the Hong Kong Takeover List and a dozen other lists (new lists are constantly being added). With everything fully automated there is no room for “missing” a takeover or copying and pasting the wrong ISIN. Now that’s what we call automation!