demand by establishing specialized big-data departments, courses, degrees
and external partnership arrangements. Some of these institutions have
nominated candidates for Canada Research Chairs in big-data analytics
research; Dalhousie’s Dr. Matwin holds one of the first such posts.
With all this interest in big-data analytics, it appears to have reached
a tipping point – or an “inflection point,” as Tamer Özsu puts it. Dr. Özsu,
a University of Waterloo professor of computer science, compares the
surge in interest to the explosion in genomics research in the early part
of this century.
Funding agencies are taking note. In the United States, says Dr. Özsu,
the Obama administration has made big-data research a priority. There’s
no comparable program here, but Canada’s three major research granting
agencies – the Natural Sciences and Engineering Research Council, Social
Sciences and Humanities Research Council, Canadian Institutes of Health
Research plus the Canada Foundation for Innovation – released a consultation document asking for feedback on the components of a granting
program that would support research into the management of big data.
As the document notes, “The focus of data analysis is rapidly shift-
ing to embrace not simply technical development but also new ways of
thinking about social, economic and cultural expression and behaviour.
Indeed, innovative information and communications technologies are
enabling the transformation of the fabric of society itself, as data becomes
the new currency for research, education, government and commerce.”
Despite its potential, there is no generally agreed-upon definition for
what is big data, a catch-all phrase that seems to apply to a very broad
range of information. Wikipedia defines it as “a collection of data sets
so large and complex that it becomes difficult to process using on-hand
database management tools or traditional data processing applications.”
Some examples of big data include the torrent of GPS signals emitted
by cell phones, the transaction records that accumulate in the servers of
companies with busy e-commerce sites, or the enormous amount of key-
board strokes from workplace computers to monitor, perhaps, employee
performance. As Wikipedia notes, managing and understanding data sets
that contain so many types of information represents an entirely differ-
ent sort of analysis from more traditional research approaches.
In fact, standard statistical tools may not generate meaningful predic-
tions because samples that appear to be large by conventional research
standards may represent only a tiny slice of the overall data set. Program-
mers may be able to gather and analyze tens of thousands of tweets on
Twitter, for example, but these may account for a mere fraction of the
total, thus limiting generalizations about the data.
Because of this, says Dr. Özsu of U of Waterloo, data experts focus
on the four Vs – volume, velocity, variety and validity – when they work
with big data. Volume refers to the amount of data; variety to the number
of types of data; velocity to the speed of data processing; and validity (or
sometimes “veracity”) to the uncertainty of data.
By definition, then, there is a great deal of data in many different
formats, and the programming tools must be capable of analyzing them
quickly and accurately. In some cases, the information may be extremely
heterogeneous – a vast soup that can include snippets of text and images
and all sorts of background noise.
To begin to explain the patterns involves techniques to categorize the
types of data and tools to “clean” databases of extraneous information.
Periklis Andritsos, an assistant professor at the University of Toronto’s
faculty of information, says one useful tool he uses measures connec-
tions between data points that aren’t numerical in nature – for example,
the frequency with which certain words or names come up in relation
to other search phrases. These analytical methods can be used in a wide
range of contexts. “Everywhere you see data,” he says, “you see opportu-
nities for these applications.”
The world of finance is one important area. Dennis Kira, a professor
of supply chain and business technology management at Concordia Uni-
versity’s John Molson School of Business, says different applications ex-
ist for fields like credit-card fraud detection, equity trading and forensic
accounting, with systems designed to sift through billions of transactions
and look for anomalies. “That’s why the banks are really gung-ho,” says
Dr. Kira, who has taught a data-mining course for five years for finance,
marketing and management students. “It’s like looking for a needle in a
haystack,” but with the new applications “you know it when you see it.”
Managing and understanding data sets
that contain so many types of information
represents an entirely different
sort of analysis from more traditional