in Economics

Using Zillabyte to Anticipate Changes in Precious Metals Prices

Big Data is transforming economics and finance. Mining large datasets culled from proprietary social networks like Twitter and Facebook enables researchers, professional economists, and traders to identify hot stocks by tracking trends in word and ticker frequencies.

While these early developments look promising, their potential is limited for a number of reasons.

First, the insights from this analysis are limited due to the narrow scope of the data (short messages and posts from two networks).

Second, the brevity of the messages/posts limits their potential informational content. Third, much of the valuable content within tweets and posts are in the form of links back to blogs and traditional webpages residing on the open Internet.

Until now, those looking to mine the open Internet faced two bad choices: 1) search Google and do A LOT of copying and pasting or 2) learn how to program a custom web crawler in addition to mastering a big data platform like Hadoop, then crawl, extract, clean and store the web’s data, then do long and slow batch processing to extract your results.

If you weren’t willing to become the data mining equivalent of a sweatshop worker or a rocket scientist, you were out of luck.

Zillabyte overcomes these issues because it brings the entire informational content of the open Internet to you by way of a very intuitive and easy to use API.

With a few lines of Ruby code, the information that moves the world’s financial markets (blogs, breaking news, press releases, etc.) is available at your fingertips to be analyzed. To show you how ridiculously simple this is, I’ll walk you through a short Zillabyte app, written in the Ruby language.

Suppose you are a commodities trader at the Chicago Mercantile Exchange and want to detect changes in the world’s information set as it pertains to metals traded on your exchange – i.e. aluminum, coal, copper, gold, iron, palladium, platinum, silver, steel, and uranium.

Without Zillabyte, this task would be every financial intern’s worst nightmare. But, with the following app, we will extract every webpage in the world containing these key words in just a few minutes.

Zillabyte App 1: Finding metals

require 'zillabyte'
require 'uri'

# Create our Chicago Board of Exchange Metal Counting 'app'
app ="cboe_metals")

pages = app.source("homepages")

metals = ["gold", "silver", "copper", "platinum", "palladium", "steel", "uranium", "iron", "coal", "aluminium"]
metal_stream_names ={|m| "#{m}_stream"}

metal_streams = pages.each do 
  name "all_metals"
  # Stream names for each metal
  emits *metal_stream_names

  execute do |page|
    # This is called on every web page
    url, html = page['url'], page['html'].downcase
    domain = URI(url).host.downcase
    metals.each do |metal|
      metal_count = html.scan(metal).count
      if metal_count >= 1 # threshold for report 
        emit "#{metal}_stream", :url => page['url'], :domain => domain, :word_frequency => metal_count

metal_streams.each_with_index do |metal_stream, index|
  metal_stream.sink do
    name "has_#{metals[index]}" #Name of the relation we store as in the relational database
    column "url", :string
    column "domain", :string
    column "word_frequency", :integer

This program is similar to the Hello World in the Zillabyte docs, but performs the same task for the different metals listed. In addition, it also stores the [ruby]word_frequency[/ruby] along with the url in an appropriate sink. The diagram below illustrates the key components, along with their few crucial lines of code. In this app, several relations are produced, one for each metal in our analysis.


Run this app by pushing it up to our cluster using the command

zillabyte push

After running the app, the results can be further analyzed from the command line. For example, from the terminal, the following Zillabyte command returns the number of records (webpages) that contain the word ‘gold’:
(this command is deprecated)

zillabyte sql “SELECT COUNT(*) FROM has_gold”

In most cases, however, it’s easier to export your results to your local machine. Zillabyte makes this easy – just type the following command into your terminal:

zillabyte relations:pull has_gold_sink /path_to_your_dir/golddata.gz

By default, the extracted data is a comma-delimited text file (stored within a compressed .gz file). The uncompressed text files are easy to import into a relational database (e.g. MySQL, PostgreSQL) or spreadsheet program like Excel (if the file size isn’t too large). For this post, I exported each metal’s data relation as a separate file, and uploaded each into their own data table in PostgreSQL.

Basic exploratory analysis reveals some interesting features of the data (see Table 1).

First, the probability that any of our commodities appear at least once on randomly selected webpage is fairly low, ranging from 0.08% for uranium to 9.63% for iron. However, if one of these words appears on a webpage, that word will appear (on average) about four times on that page (see column three).

Interestingly, this statistic is fairly constant across all of our metals. Finally, we can calculate the lexicographic frequency of our metals relative to a benchmark. Choosing gold (which had the overall highest occurrence count) as the benchmark, the relative word frequencies are provided in column four.

To give perspective, I used Google’s Ngram Viewer to determine the relative frequency with which these words appeared in print in English language books in the year 2000.

Note that the relative frequencies are very similar. For example, the word ‘iron’ occurred 83% as often as gold in the open web compared to 85% as often within books.

However, that said, there were some notable differences. Specifically, other precious metals (i.e. silver, platinum, and palladium) occurred with a noticeably higher relative frequency in the web data as compared to the book data.

Table 1: Frequency relative to gold

wordword probabilityAvg. occurrences when presentZillabyte

This discrepancy in word frequency reveals a major problem: the web contains many jewelry retails, and therefore precious metals are likely to occur with a higher frequency than in books. The extent of this problem becomes more obvious when we run inner join queries on URLs across the metal-specific database tables.

Normalizing these shared page counts by the number pages that contain a given metal by itself yields very interesting conditional probabilities (see Table 2).

Table 2: Correlation of occurrence

Rows: These terms are present Column: Probability that these other words will appear

In Table 2, the values refer to the probability that the words in the rows will occur at least once on a page if the words in the columns already appear on that page. Examining the intersection of the gold row and the palladium, platinum, and silver columns, it is clear that the probability that gold appears if any of these other precious metals also appear on a page is quite high (39% to 46%).

Compare this to the 6.5% unconditional probability that gold appears on a randomly selected page. In other words, the mention of a precious metal on a given page implies that gold is six to seven times more likely to appear (6x to 7x).

The presence of gold and precious metal online retailers introduce unwanted ‘noise’ into our page count statistics. To reduce that noise, and focus our results on bona fide metal-related pages, I constructed a modified Zillabyte app that both filters out obviously bad results, and subsequently scores the remaining page hits based on economic contextual quality.

Program 2

require 'zillabyte'
require 'json'

app ="cboe_gold")

pages = app.source("homepages")

terms = ["bracelet","cash","coin","jewelry","necklace","olympic","pawn","ring","watch"]
@r ='|'))

filtered_url_stream = pages.each do |page|
  url, html = page['url'], page['html'].downcase
  if html.scan(/gold/).count > 0 and html.scan(@r).count == 0
    emit :url => url, :html => html

gold_index_stream, gold_terms_count_stream = filtered_url_stream.each do 

  emits "gold_index", "gold_terms_count"
  prepare do
    log "preparing filter terms"
    filter_terms_map = JSON.parse('filter_terms.json'))
    @regex_map = []
    # Create the regex objects once per category
    filter_terms_map.each do |cat|
      filter = {}
      filter[:category] = cat["category"]
      filter[:regex] = cat["words"].join('|')
      @regex_map << filter
    @num_categories = filter_terms_map.count

  execute do |page|
    url, html = page['page'], page['html']  
    #Iterate through categories, generating count for each.
    all_cat_count = 0
    @regex_map.each do |cat_regex|
      category = cat_regex[:category]
      reg = cat_regex[:regex]
      term_count = html.scan(reg).count
      if term_count > 0
        emit "gold_terms_count", :url => url, :category => category, :term_count => term_count
      all_cat_count += [term_count, 1].min
    index = (100*(all_cat_count/@num_categories)).round(0).to_i
    emit "gold_index", :url => url, :index => index

gold_index_stream.sink do
  name "gold_index"
  column "url", :string
  column "word_frequency", :integer
  column "index", :integer #This could be a float number too

gold_terms_count_stream.sink do
  name "gold_terms_count"
  column "url", :string
  column "category", :string
  column "term_count", :integer


For the purpose of brevity, the scope of this follow-up program is limited to gold. However, these techniques could be applied to each of the remaining metals. This program differs from the first in a few key respects.

We first filter out any web page containing terms associated with the other context of gold. This should exclude jewelry retailers, pawn shops, cash-for-gold operations, and sites dedicated to the Olympics (i.e. we want gold metals not gold medals).

The next component processes the prospect webpages, calculating two results: a gold index, and a gold terms count, each of which will be sunk and stored in its own relation. The search terms contained within this program are designed to detect key words and phrases related to economic fundamentals.

The specific terms were selected based on personal experience and intuition, but could be substituted with words or phrases (ngrams) derived from word frequency research on handpicked gold-related pages (i.e. vis-à-vis supervised machine learning). The program imports these search terms, which are stored in a JSON formatted file called ‘filter_terms.json’.

Table 3 summarizes the search terms and categories defined in the JSON file. Please note that the words in column 3 are not misspelled, they are the roots/stems of the words of interest (e.g. searching for ‘commodit’ will capture both ‘commodity’ and ‘commodities’).

Table 3: Filter categories

bubbleResearch suggests that the word 
frequency of the word 'bubble' is
correlated with actual financial
bubbles (and their bursting)
commodityPages of interest are likely to 
discuss metals from a 
commodity-centric point of view
commodit, metal, bullion, ore,
environmentEnvironmental policy can
significantly affect gold prices
environment, climate change,
global warming, co2, pollut
exchangePages discussing news/changes in
gold markets
buy, sell, trade, exchange, contract,
transaction, settlement, volume
forecastPages discussing consensus forecastsforecast, predict, estimate, consensus
governmentPages discussing government 
policy and their impact on gold
monetary, fiscal, tax, regulation, 
policy, policies, politic, ban
householdPages discussing household 
fundamentals and their impact on 
demand for gold
income, preferences, wealth
innovationPages discussing innovative
 changes related to gold
new, game chang, revolutionary, 
international tradePages discussing international
 trade and gold
import, export, tariff, quota
investingPages discussing gold investmentinvest, portfolio, speculat, holding
laborPages discussing labor conditions
and gold
wage, labor, union, strike, lockout
market conditionsPages discussing gold market conditionssupply, demand, shortage, surplus
miningPages discussing gold miningmine, mining
pricePages discussing gold pricesprice
productionPages discussing gold productionprofit, loss, revenue, earning, production, output, technology, extraction
quantityPages discussing gold quantitiesquantity, troy oz, troy ounces
riskPages discussing gold and market riskrisk, volatil, bear, bull, trend

The total keyword matches for each category are tabulated and emitted. If the webpage contains keywords in the ‘household’ category, then the ‘gold_terms_count_stream’ will emit the hash containing the URL, category, and term count for the ‘household’ category on that page. An index of economic relevance is constructed to capture the proportion of the economic categories represented on a given web page.

Looking at the actual results, the retail/pawn/cash-for-gold filtering clauses in the new program reduced the gold-related record count by 6.2%.

This is a big improvement, and careful examination of the updated results can help guide additional revisions to the ‘junk’ filter (e.g. the new results contain numerous gambling-related websites which could be easily filtered).

Because the new program also assigns an index score to each website based on the breadth of topics covered, it’s possible to make quality judgments about the data. Figure 1 provides a histogram of the data’s index scores.

Figure 1

Setting an appropriate threshold index score is a highly effective technique to further remove unwanted/poor quality results. For example, results that have key word coverage in fewer than four economic categories have a corresponding index score of 18 (= 100 x (3/17)) or less. Collectively, these low-quality results represent 30% of the updated results. Removing them yields a high quality gold-related dataset that is prime for additional data mining.

As you can see, Zillabyte is a great tool for extracting useful information from the Internet. Unlike other APIs, you are not limited to extracting data from proprietary networks (e.g. Twitter) or a fixed number of user-defined websites (e.g. If you want to mine the Internet for economic data, Zillabyte is the best platform available.

Write a Comment