Identity theft, fraud and related crimes are increasingly common occurrences. The Center for Identity at The University of Texas (UT CID) is studying these crimes to determine the methods and resources used to carry them out, the vulnerabilities exploited, and the consequences of these incidents. The UT CID Identity Threat and Assessment Project (ITAP) is building a computational model with over 5,000 national and international identity incidents and counting, and applies tools to analyse these threats, losses, and trends with notable and sometimes surprising results. The ITAP provides some guidance on how to avoid these crimes.

Identity theft, fraud, abuse, and exposure reportedly affected an estimated 15.4 million persons in 2016 in the United States alone[1] and these crimes have been at or near the top of the Federal Trade Commission’s national ranking of consumer complaints for the 17th consecutive year. Investigating identity theft, fraud, abuse, and exposure events (hereafter ‘identity theft’ for the sake of brevity) in detail, and thoroughly understanding them, is a first yet significantly important step toward countering these problems.

Literally every day, multiple incidents of identity theft are reported in the news media. At the UT CID, the Identity Threat Assessment and Prediction (ITAP) project is gathering identity theft information from news stories, structuring this information, analysing it, and discovering trends and characteristics. While other researchers have utilised various sources of identity theft information – such as agency data, surveys, and anecdotal reports – our approach is to extract identity theft information directly from news stories.[2] (ITAP, however, is not intrinsically limited to news stories. One could, and in the future we might, use other sources of identity theft data to populate the ITAP Model.)

Novel data source

Previous work has introduced several identity theft data sources: agency data and surveys.

  • Agency data, while comprehensive, are usually available only to law enforcement and have been criticised for lack of consistency, under-reporting, and bias due to change in consumer awareness and agency policies.
  • Surveys, apart from variance in their sample sizes and methodologies, have been criticised for non-response bias, difficulty to contact victims, and relying solely on victims’ memories.

Our novel use of news stories as a data source, how­ever, demonstrates characteristics such as volume, availability, recency, and reliability. News stories com­plement other sources by encompassing a wide range of identity theft stories, from victims, law enforcement, and companies. This data source has some bias too. The news media obviously tends to report stories that are considered ‘newsworthy’. In addition, similar to any data gathered independently of the final analysis, a drawback of some news stories is that not all the ana­lytical questions can be readily answered. Consequently, we must use ‘varies’ or ‘unknown’ when the answer to an analytical question is not available in the story.

Gathering methods

The ITAP project gathers media news stories on identity theft via two distinct methods. First, we set up an RSS feed to monitor several websites that report on cases of identity theft. Second, we created a Google Alert, which provides a daily notification of any new website indexed by Google that reports about identity theft. The news story webpages collected through these two methods are manually winnowed down to a list of identity theft reports. If the same identity theft incident is reported or updated multiple times, it is manually combined into one story. Most of our stories come from respected newspapers or cyber security websites.

We then add the information collected from the news stories to the ITAP Model we built with the AWAREness Suite application[3], a web-based system for modelling and quantifying data. The ITAP Model is a comprehen­sively structured collection of over fifty details about each identity theft incident. It includes features such as the type of incident, how and when the incident occurred, the methods and resources used by the per­petrators, the vulnerabilities exploited, the types of personal information compromised, the demographics of the victims, and the consequences for the victims and perpetrators.

We apply various analytical tools to the ITAP Model to reveal useful overarching statistics and trends regarding identity theft. The ITAP Dashboard is a set of tailor-made charts and tables we have developed to explore a variety of particularly interesting aspects of the identity theft incidents, such as the resources most frequently used by perpetrators and the geographic distribution of identity theft over the United States.

In summary, the ITAP project makes the following contributions: It gathers, models, and analyses a large number (currently about 5,400) of identity theft news stories. Although not necessarily limited to this data source, ITAP is the first project to use identity theft news stories in this manner. By modelling and analysing the identity theft information, ITAP uncovers various interesting features and trends in the world of identity theft, fraud, abuse, and exposure.

ITAP Dashboard analytics

The analytics provided by the ITAP Dashboard are custom-built for showing interesting facts extracted from the ITAP Model regarding identity theft, identity fraud, and other cases in which personally identifiable information (PII) is compromised. In this section, we show and explain a number of these analytics. The analyses are divided into three categories, according to what aspect of the incidents they primarily pertain: the events themselves, the victims, or the perpetrators.


Amount of non-malicious activity: This is the percentage of incidents in ITAP in which PII is compromised, but without malicious intent on the part of those respon­sible. They are commonly caused by human error of some sort. Currently, the percentage is just over 17.4.

Digital vs. non-digital theft: A theft is considered purely digital if the resources used by the perpetrator(s) include nothing other than computers (or other digital devices), the Internet (or other computer networks), and information accessible via such networks. A theft is purely analogue if it primarily involves physical actions (beyond those required to operate a digital device), for example, breaking into an office and stealing a briefcase. An example of ‘both-digital-and-analogue’ could be a case in which the perpetrator gets someone to reveal a password over the telephone via social engineering (analogue), and then uses the password on a website to access the victim’s bank account information (digital). In ITAP, 53% of the thefts were non-digital, 46% were digital, and 1% were both.

Figure 1: The top ten affected market sectors and their percentages.

Market sector: Here the ITAP user selects the number of most commonly affected market sectors they want the chart to display (i.e. Top 5, 10, 15, or All) to see a horizontal bar chart showing the corresponding percentages of incidents associated with that sector. Figure 1 shows that the top ten sectors, in order, are: Consumer/Citizen, Healthcare & Public Health, Govern­ment Facilities, Education, Financial Services, Com­mercial Facilities, Defense Industrial Base, Information Technology, Law Enforcement, and Food & Agricultural. (The sectors we consider are the US Department of Homeland Security’s sixteen Critical Infrastructure Sectors[4] and three others we find useful: Consumer/Citizen, Education, and Law Enforcement).

Note that 90% of all incidents fall under one of the top six sectors.

National impact of identity theft: This is the percentage of US-based events in which PII was compromised and the incident was local to a particular city (or cities), county, state, or region. This is as opposed to incidents that have nationwide or worldwide effects. The per­centage of localised incidents is currently a very high 99.64%. Thus only 0.36% of the incidents spanned the whole of the United States, such as the infamous Target breach in 2013[5] and Equifax breach in 2017[6]

Figure 2: Age groups of victims and their percentages.
Figure 3: Distribution of the annual incomes of victims


Age group of victims: This bar chart shows the per­centages of incidents affecting victims of different age groups. Though adults generally were the most-affected group at 71%, seniors were specifically targeted in 21% of the events (see Figure 2). Note that the primary goal of charting data based on commonly used age range vocabulary (‘child’, ‘adult’, and so on) is to assess the risks for these special age groups. As a result, the age ranges inevitably have different sizes; for example, the adult age range covers a spread of perhaps 40 years, whereas the child range covers only few years.

Annual income of victims: This bar chart shows the percentages of incidents affecting victims in various income ranges.

Figure 4: Education levels of victims and their percentages.

Education level: This horizontal bar chart shows the percentages of incidents affecting victims of different levels of education. It turns out that the college-educated are harmed the most often (see Figure 4).

Figure 5: Types of loss incurred and their percentages.

Type of loss: This horizontal bar chart displays the percentages of incidents with respect to the types of loss incurred by the victims. Figure 5 shows, notably, that emotional distress is experienced more often than other types of loss, such as financial and property loss.

Additional note regarding Figures 2-5:

The percentages shown in these charts total more than 100%. This is due to the fact that a single incident often affects victims of more than one age group.

Figure 6: The top five types of perpetrators.


Performers: Here the user selects the number of most common types of performer they want to be displayed (i.e. Top 5, 10, 15, or All) to see a horizontal bar chart showing the corresponding percentages of incidents associated with those performers. Figure 6 shows the current top five performers and their respective percentages. In ITAP, briefly, a fraudster is one who misuses PII, a thief is one who steals PII, and a hacker is one who creates or exploits a digital or computer-based vulnerability in order to compromise identity assets. An employee (of the compromised entity) and a medical service provider mistakenly expose PII without malicious intent.

Figure 7: The top ten resources used by perpetrators.

Resources: The user selects the number of most commonly used resources they want to be displayed (i.e. Top 5, 10, 15, or All) to see a pie chart showing the corresponding percentages of incidents associated with those resources. (The percentages are normalised to total 100% regardless of the number of resources shown.) Figure 7 shows the top ten resources used.

Insider vs. outsider activities: This statistic shows the respective percentages of incidents in which the per­petrator(s) were insiders, outsiders, or both insiders and outsiders. Insiders include employees of compa­nies and family members of individuals. About 34% of the events were performed solely by insiders, 62% solely by outsiders, and 4% by both.

Figure 8: The top five types of PII compromised.

PII compromised: The user selects the number of most commonly used types of PII they want to be displayed (i.e. Top 5, 10, 15, or All) to see a pie chart showing the corresponding percentages of incidents that have those PII types associated with them. (Here again, the percentages are normalised to total 100% regardless of the number of PII types shown.) Figure 8 shows that the top five compromised PII types are: name, social security number, date of birth, address, and credit card information.

Financial loss per attribute: The user selects an item from a list of PII types and other personal attributes to see a dollar amount representing the average financial loss associated with the selected attribute. The amount is calculated as the mean amount of money lost in incidents in which that attribute was compromised. Note that the money
lost is not averaged per victim, as the number of victims is unknown in many cases. Figure 9 shows the loss amounts for the top five frequently used personal attributes.


Other authors:

Razieh Nokhbeh Zaeem received her Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin in 2014. As a Google Anita Borg Scholarship Finalist in 2010, she interned at Rockwell Automation Inc. and at Fujitsu Labora­tories of America. She is Research Associate at the Center for Identity and has published in prestigious journals and conferences on a broad range of topics from automated software engineering and data mining to privacy concerns and identity protection.

James Zaiss is currently a Research Scientist at the Center for Identity, University of Texas at Austin, where he oversees among others the Identity Threat Assessment and Prediction (ITAP) project. Prior to that, he was a Senior Product Manager at AWARE Software, Inc. and a Senior Ontologist at Cycorp, Inc. Jim taught philosophy for 10 years as an assistant professor and holds a Ph.D. in Philosophy from the University of California, Irvine.



The main products of the ITAP project include the continually growing ITAP Model, which consists of structured information gleaned from (currently over 5,000) news stories reporting incidents involving the exposure, theft, or fraudulent use of PII. The steps taken by the perpetrators, the resources they used, the types of PII that were compromised, and other salient attributes of the incidents, the victims, and the perpetrators are captured in the model. The ITAP Dashboard reveals novel and sometimes surprising results. For example, one third of the incidents were performed solely by insiders, and senior citizens are particularly vulnerable to identity threats.


  1. Pascual, A., Marchini, K. and Miller, S. (2017). 2017 Identity Fraud: Securing the Connected Life.[Accessed 15 January 2019].
  2. Yang, Y., Manoharan, M. and Barber, K.S. (2014). Modelling and Analysis of Identity Threat Behaviors through Text Mining of Identity Theft Stories. In 2014 IEEE Joint Intelligence and Security Informatics Conference (JISIC 2014), pp. 184-191.
  3. Critical Infrastructure Sectors[Accessed 15 January 2019].
  4. Rosenblum, P. (2013). Target Hit By One Of Most Sophisticated Data Thefts Ever, But It Won’t Hurt The Retailer.[Accessed 15 January 2019].
  5. (2017). Breach at Equifax May Impact 143M Americans.[Accessed 15 January 2019].
Figure 9: Average losses for cases in which the top five frequently used attributes were compromised.

Further reading


This research is the sole work of The University of Texas Center for Identity. Dr. Barber’s affiliation with The Department of Homeland Security does not imply DHS endorsement of the research publication herein.

+ posts

Dr Suzanne Barber is the AT&T Endowed Professor in Engineering and founding director of the Center for Identity at the University of Texas. She also serves as a member of the US Department of Homeland Security’s Data Privacy and Integrity Advisory Committee.

Previous articleOptical machine authentication of security documents
Next articleBig data: from marketing to safer streets