The Bank of Canada has recently begun collecting data on all individual notes processed by their high-speed sorting equipment, and now houses over a billion records on banknotes. This paper describes some of the challenges of big data for banknotes, and how the Bank has learned to make use of big data analytics for banknotes, believed to be valuable tools for the currency industry. Three uses of big data analytics are discussed to show how these techniques are readily adaptable to banknote data. Specifically, a correlogram analysis of wear categories, principal component analysis to simplify the data, and data mining using Association Rules to understand the relationships between types of wear are presented.
Traditional banknote circulation trials deal with thousands to millions of notes, yielding tens of millions of data points if those notes are processed on high-speed note sorting equipment. While that sounds like a lot of data to process, because of the relatively simple form of the data (integers and strings), it is still possible to use conventional computing techniques, statistical tools, and sample statistics. In fact, in 2017, the Bank of Canada published a study on their circulation trials using traditional regression techniques on single-CPU systems.
However, when the number of banknote records grows from a few million to a few billion, new tools and techniques are needed to make sense of these data. The Bank of Canada has recently begun collecting data on all individual notes processed by their high-speed sorting equipment, and now houses over a billion records on banknotes. They have had to adapt their methods to the new scale of the data, marking an important departure from traditional tools used to examine data on circulating banknotes. Central banks are fortunate, as they require advanced analytic capabilities to fulfil their role in taking informed policy decisions. The Bank of Canada has a rich analytic environment that includes the tools, infrastructure, and support to analyse large data sets on banknotes.
|Serial number||Issue date||Date of record||Tape areas sum (sq. mm)||Ink wear Front||Crease
To understand the evolution of banknote quality, the Bank of Canada requires data that track a banknote’s wear measurements over its lifetime. Previously, they relied on circulation trials that allowed them to capture data by serial number on a few million banknotes with a controlled issuance interval. More recently, they have started tracking data on all issued banknotes. For each note, they know the issue date, processed date, and sensor fitness readings for 22 wear categories (see Table 1 for partial sample data). The Bank has collected over one billion individual banknote sensor records, totalling over 400 GB of alphanumeric data. For perspective, consider that Amazon estimates the average size of a Kindle book to be around 2 kB per page. If the banknote data were to be Kindle book pages, this would equate to approximately 200 million pages.
Technical challenges of big data on banknotes
The immense volume of data presents both an opportunity and a challenge. The opportunity is to uncover hidden knowledge from previously untapped data, while the challenge is that traditional data management technologies and business intelligence tools are ill-equipped to tackle large volumes of data. Additionally, popular statistical software such as R has the limitation of only being able to process data that fit within the available memory. Most personal computers are equipped with 8 to 32 GB of RAM, and these are the upper bounds of the size of the data set you can process on a PC. Practically, you still need to allocate memory resources for other processes on the computer, and as a rule, the size of the data to be processed should not exceed more than 50% to 60% of available memory. The Bank’s data set of 400 GB far exceeds the capacity of desktop systems, and will continue to grow as they process notes and add them to their warehouse.
To address these limitations, the Bank has turned to the open-source Spark cluster computing framework. Spark can use resources from many computer processors linked together and is a scalable solution, meaning that as more computing power or memory is needed, they can simply introduce more processors into the system.
Below, three uses of big data analytics on banknote data are discussed – not to draw attention to their conclusions, but to show how these techniques are readily adaptable to banknote data: a correlogram analysis of wear categories, principal component analysis (PCA) to simplify the data, and data mining using Association Rules to understand the relationships between types of wear.
Correlations of banknote wear
Correlation is a statistical method that can show whether, and if so how strongly, pairs of variables are related. A correlogram is a visual representation of a correlation analysis that can make relationships between large sets of variables easy to see. A correlation analysis was run on 1.1 billion sensor records to observe how banknote wear categories are related to one another, and a correlogram containing the most important correlations is presented in Figure 1. Interpretation is easy – just look for the shaded/filled-in cells, and match them up with the wear categories along the diagonal.
For example, as one might expect, tears and tape are related, as are creasing and foil scratches, and creasing and ink wear. However, the relationships aren’t perfect (for example, there could be banknotes with tears and no tape and vice versa). Although these correlations seem obvious, some aren’t. For example, folded corners correlate with graffiti on the window area of a banknote. At first glance, this correlation seems spurious, but further investigation showed that a folded corner covering the window of a note is detected as graffiti in the clear window by the sensor.
While this visual tool helps to see the relationships between pairwise wear categories, it doesn’t show how groups of pairwise correlations relate to one another, and it is still overwhelming to see the structure of the relationships among a large set of variables. For example, the 22 variables (not all captured in Figure 1) represent 231 pairwise correlations. To this end, principal component analysis (PCA) helps group variables together and is used as a data simplification tool.
Principal component analysis (PCA)
PCA is a data compression technique that replaces a larger number of correlated variables with a smaller number of uncorrelated variables, simplifying the data set on hand. Unlike correlation, it allows for relationships among sets of variables, rather than just pairs. PCA is generally not a means to an end, as the resulting components are often inputs into other models such as regression, classification, and clustering. The use of PCA was intended to evaluate whether 22 correlated note wear variables could be transformed into 5 to 10 uncorrelated variables that retain as much original information as possible.
Figure 2 illustrates the structure of the variables with the analysis; wear data can be explained by 8 new variables (principal components), a reduction from the 22 original variables. Each PCA variable is a scaled composite variable with a weighted average of the original variables. For example, the index variable RC1 is made up of original variables Maximum Closed Tears, Sum of Closed Tears, and Tape, with weights 0.9, 0.9, and 0.6 respectively. The 8 principal components capture 60% of the variation in the original fitness variables. While this does mean that variability isn’t explained well, the results suggest that there might be some merit in combining redundant variables when analysing banknote quality data, and they can simplify some downstream analytic activities.
While relationships between variables can be captured by correlations and PCA, they are susceptible to spurious relationships when there are dominant wear categories that occur very frequently. To limit spurious relationships, and to understand probabilities of occurrence for wear categories, Association Rules can be used.
Association Rules (Market Basket Analysis) algorithm
Association rule analysis is used to search for interesting connections among a very large number of variables; these include objects or attributes that frequently occur together. There are many use cases of association rules mining in various industries: products that are often bought together during a shopping session, queries that tend to occur together during a session on a website’s search engine, etc. Human beings are capable of such insights intuitively, but it often takes expert-level knowledge or a great deal of experience to do what a rule-learning algorithm can do in minutes.
Banknote data are simply too large and complex for a human being to ‘find the needle in the haystack’, so association rules can be used with banknote wear variables to identify banknote wear categories that frequently occur together. The algorithm allows mining for rules such as ‘if X, then Y’, meaning that every time a set of banknote wear categories (X) occurs in a banknote (this is the Antecedent), a second set (Y, the Consequent) is also expected with a given confidence. The authors used the Frequent Pattern (FP) growth algorithm with Spark to compute the association rules. Before moving on to the results of the Association Rules algorithm, three ratios are defined that are very important in interpreting the association rules:
- support: the frequency at which the combination of wear categories occurs in the data;
- confidence: probability that a rule is correct for a new banknote processed;
- lift: the ratio by which by the confidence of a rule exceeds the expected confidence.
Please note that if Lift is greater than 1, this implies that the wear categories are found together more often than one would expect by chance. A large Lift value is therefore a strong indicator that a rule is important, and reflects a true connection between the wear categories.
Figure 3 visually shows the significant association rules with a minimum 0.5% support and at least 50% confidence, in a forced network graph. Rules with foil scratches, missing foil, creasing, and opacification wear are very prevalent throughout the network as seen by the high degree centrality of these nodes. Degree centrality identifies the most important nodes within a graph as defined by the number of links incident upon a node. This shows that the influential wear categories foil scratches, missing foil, creasing, and opacification can be seen in a banknote with a high degree of probability in combination with other defects. Table 2 illustrates the support, confidence, and lift metrics for the top association rules of interest.
|Foil Scratch||Opac Wear||26.78||99||1.15
|Foil Scratch||Missing Foil||25.18||93||3.6
|Missing Foil||Foil Scratch||25.18||98||3.61
|Foil Scratch, Missing||Opac Wear||24.84||99||1.15
|Foil Scratch, Opac Wear||Missing Foil||24.84||93||3.6
|Missing Foil||Opac Wear||24.84||96||1.12
|Graffiti Window||Opac Wear||16.17||98||1.14
|Closed Tears||Opac Wear||14.28||98||1.14
|Graffiti Front||Opac Wear||9.42||94||1.1
|Graffiti Back||Opac Wear||6.98||70||2.58|
|Creasing, Foil Scratch||Opac Wear||6.19||99||2.64|
It is interesting to note the number of rules with Opacification Wear as the consequent. Upon further investigation, it is evident that there is some level of opacification wear on 85% of the notes processed and therefore you would expect opacification wear to be a consequent for many rules. This might be due to the limitations of the sensor as it has a high probability of recording false positive opacification. Other rules include foil scratches leading to missing foil, creasing on the notes causing foil scratches, and creasing causing missing foil. Foil scratches and missing foil are detected on 25% of notes processed. When foil scratches are detected on a note, there is a 93% probability (confidence) there is also missing foil on the same banknote. Creasing and foil scratches are detected on 6% of notes processed. When creasing is detected on a note, there is a 70% probability there are also foil scratches on the same banknote.
Conclusions and next steps
The large (>1 Billion) banknote data set of the Bank of Canada can be analysed using big data tools and techniques, and application of these tools can yield new and interesting insights. The Bank’s data set will continue to grow as they continue to process notes on their high-speed sorting machines. They now have a scalable solution (Spark) that will allow them to work with these data in their entirety. As polymer notes continue to age in Canada, the Bank will continue to make use of big data techniques to look for new and valuable insights.
The Bank of Canada recently expanded its mandate to maintain quality notes in circulation. They believe that continuing their path of understanding and applying analytic techniques holds the best possibility of yielding actionable insights in support of this mandate.
The authors would like to thank Douglas Shorkey and Anastasija Kokanovic for their support and contributions related to setting up the analytic environment and data modelling. They would also like to thank Ted Garanzotis for his assistance in editing.
- Malmberg, S., Graaskamp, L., and Balodis, E. (2017). Banknote wear in Canada: Comparing paper and polymer substrates for a whole series transition. Keesing Journal of Documents & Identity, Vol. 53, pp. 3-8.
- Kyrnin, J. (2018). The Right File Sizes for Kindle Books. Lifewire.
[Accessed 26 Nov. 2018],
- Walkowiak, S. (2016). Big Data Analytics with R. Packt Publishing, Birmingham.
- Lantz, B. (2013). Machine Learning with R. Packt Publishing, Birmingham.