The application of digital forensics techniques in establishing the facts of a crime is not new. First termed ‘computer forensics’, these techniques began during the mid 1980’s, growing in popularity throughout the late 80’s and early 90’s, with the first ever Computer Analysis and Response Team being created in the USA in 1984 (Whitcomb 2002), followed a year later by a Computer Crime Squad in London’s Metropolitan Police. With the massive growth in cyber crime in recent years, digital forensics techniques and tools have developed at pace as we race to secure our information in the online space, including in the protection of intellectual property rights, of which student work could be considered an example.
The National Institute of Standards and Technology Glossary (NIST 2021) describes digital forensics as ‘the application of computer science and investigative procedures involving the examination of digital evidence - following proper search authority, chain of custody, validation with mathematics, use of validated tools, repeatability, reporting, and possibly expert testimony’. These techniques are used in criminal investigations as a means to identify the perpetrator of, or accomplices to, a crime and their associated actions. They are sometimes used in cases relating to intellectual property to establish the legitimate ownership of a variety of objects, both written and graphical (Fu et al. 2011), as well as in fraud and forgery (Jeong and Lee 2017).
Using digital forensics in academic misconduct settings has been mooted previously by Klopper (2009), who suggests that the weakness of anti-plagiarism programs is that they function on non-semantic grounds and that they only use the surface web to access documents. Klopper discusses the disciplines of Forensic Auditing and Computer Forensics and their shortcomings, and suggests that a new interdiscipline combining Cyber Forensics and Forensic Linguistics, namely ‘Cyber Forensic Linguistics’, would help harness the tools that could counteract plagiarism. This focuses on linguistics and forensic analysis using computing techniques, linked to the surface and dark web. However, as with many previous suggestions of ‘forensics’ this approach focuses primarily on the text or language within the document, not on the document as an object in its own right.
Examples of digital forensics in law enforcement
In both cyber crime and physical crime, digital forensics are playing an increasingly important part in evidence gathering. Freeman and Llorente (2021) discuss various forms of digital evidence and their application in law, including video and audio evidence, data in the cloud and on the Internet, files on a hard drive, emails and text messages. There are examples of wearable devices being analysed to ascertain the movements of individuals involved in murder cases, for example the murders of Connie Dabate in 2015, where a Fitbit was used to disprove the husband’s story (Almogbil et al. 2020) and Caroline Crouch in 2021, where heart rate data was used to disprove the husband’s story (BBC 2021). In business, evidence of data exfiltration has been provided through network forensics analysis, proving that employees have sent data outside of the organisation, as in the case of Zhang and Apple (US Department of Justice, 2018) where log files were used to prove theft of Trade Secrets. The presentation of such digital evidence in court follows strict regulations for admissibility to help avoid any false conclusions being drawn.
Digital forensics tools and techniques
Before establishing whether digital forensics tools and techniques have a role in detecting academic misconduct, it is useful first to summarise some of the tools available and what information they can tell us.
Reverse Engineering in digital forensics is a technique commonly used to understand how malware works, by reversing or unpacking files and programmes back to their component parts. This is often a manual process requiring an excellent understanding of programming and compiling code. Digital forensics software does a similar thing on files and devices by taking entire storage drives and separating out all the various parts (users, graphics, text, email, chat and so on). Typically carried out on digital forensics software such as Access Data’s FTK™ (Forensic ToolKit), Magnet Forensics™, Encase™ and Autopsy™ these software solutions facilitate the building of a user profile by establishing a timeline of activity across many different components including web browsing history, some chat logs, email interrogation, image searching and much more. Similar techniques can be used on single files, although they yield less comprehensive results. During evidence gathering, investigators are permitted only to analyse areas of a device that relate to the case in question, thus protecting user profiles to some degree. It is possible that these software tools could help in the detection of some cases of academic misconduct.
Alongside these software solutions, there are some standard techniques used by law enforcement and digital forensics which could also prove useful in detecting academic misconduct. For example, ‘hashing’ is a one-way cryptographic function which takes any input (e.g. a password, image or file) and produces a unique message digest – effectively a fingerprint of the file, which cannot be reversed (in that it is not possible to reconstruct the original file from just the hash value) and which is unique to the file that was input. Therefore, if two images share the same hash value, those images must be identical. Hash values are used to swiftly examine the images on entire computer hard drives searching for known illegal images, such as child pornography, and it is a technique which is likely to become increasingly important in identifying cases of fake news and deepfakes. Fig. 1 is an example of a hash value created from a random file on the author’s computer:
Another technique used in law enforcement is that of Reverse Image Lookup (RIL). RIL is a form of Content Based Information Retrieval that uses special search image methodologies based on an image’s attributes such as colour, shape and texture to search the Internet for images that match (Chutel and Sakhare 2014). Digital Forensic investigators use RIL to carry out investigations such as determining the location of crimes that have been videoed or photographed and put on social media, to help locate missing persons and in cases of identity theft. Various online tools exist for this purpose including Tin Eye, Google Images, Yandex and Bing Image Match.
These are just a few examples of digital forensics techniques that are used by law enforcement, but examining the data using these tools requires significant expertise, meticulous attention to detail, and knowing what to look for and how to find it. Applying these techniques in an academic setting would require a radical rethink of how the tools can be used. For example, none of the digital forensics software tools (FTK, Autopsy etc) readily extract the underlying metadata (e.g. XML) from a file and yet this data can provide a rich source of information in relation to document construction, which can be very useful in an academic setting.
The law’s reach into academia
Academic integrity enforcement is not typically associated with the law and criminal proceedings, yet there are an increasing number of specific cases where this is beginning to happen. Whilst most institutions will have sort of academic integrity or misconduct policy these tend to be focused on institutional breaches and would not usually involve the law, other than in cases relating to intellectual property theft of research related works.
Quality assurance agencies in several countries have made essay mills (the organisations providing essay writing services to students for a fee) illegal. Australia, New Zealand, some states in the USA, and the Republic of Ireland all have legislation that criminalises some aspect of these services, and the UK announced in September 2021 it will shortly follow suit, meaning that it will become a criminal offence to provide, arrange or advertise cheating services for financial gain (UK Government 2021). This clear intention to use legal force to prevent contract cheating, or outsourcing of assessments, is an example of how legal processes are seeping into educational settings.
Furthermore, the UK Quality Assurance Agency (2021) has done a great deal of work in supporting academic institutions to prevent, detect and manage academic misconduct. In 2021 they provided advice for Higher Education Institutions on how to prevent the emerging threat of essay mills hacking into university websites to redirect students to their essay writing services. One suggestion they make is to block connections to university networks using the IP addresses of known essay mills (effectively using the online address of an essay mill to block access to students by filtering out any Internet traffic coming from these addresses). However, this relies on security methods such as IP scanning and IP blocking, which are only successful if the IP address of the Essay Mill is known and does not change regularly.
Types of academic misconduct
It’s useful at this point to briefly consider types of academic misconduct and how students attempt to obfuscate their actions. Copy and paste plagiarism (where content is taken from a source without proper referencing), contract cheating (where a third party provides the work for the student, often, but not always, in return for payment) and collusion are the main forms of academic misconduct that are considered in this paper. In terms of copy and paste plagiarism and collusion, students may change minor elements of the text, perhaps substituting single words throughout a document with an alternative, or making small deletions or additions, as this will thwart text matching software. Other techniques which students have been known to use to ‘beat’ text matching software include adding white characters between words, using alternative character sets and replacing text with images of text. When using images to support the narrative, students may crop out unwanted parts of the image and sometimes these cropped out areas may provide clues to the original source of the image, such as from a social media ‘chat’ or post, or from a website. In terms of contract cheating, minor changes to the work received from the contracted author are sometimes carried out, such as adding the student’s name, or changing odd words to reflect nuances of the institution where the student is based and so on.
Can digital forensics tools and techniques be used to detect and evidence academic misconduct?
Examples of tools and techniques used in a digital forensics setting have been discussed, but can these methods be repurposed to aid in the detection of academic misconduct? Techniques such as file hashing could potentially be used to confirm collusion, or to match an image in a student submission to an online image, but this application would seem to be of limited benefit. Reverse Image Lookup could similarly have some uses if, for example, an image in a student submission has not been referenced, in order to locate the original source. However, this too seems limited, perhaps being more relevant as a way of pursuing cases of intellectual property theft. Techniques for extracting forensic data can be useful and are already used in some institutions. The learning management system, for example, can provide information on access to the platform, engagement with resources, issues encountered during examinations and more via logs and reports. E-Proctoring is also used increasingly for examinations carried out at a learner’s place of choice, where analytical tools can be used to track the learner’s activities in an attempt to reduce the possibility of cheating. However, of all the digital forensics tools mentioned thus far, reverse engineering of student submissions would appear to be the most useful.
Reverse engineering in academic integrity
Microsoft file formats, along with a number of other software packages, use XML as their underlying language. XML refers to Extensible Markup Language and is used because it helps with file sharing, tracking changes and keeping file version histories which allow a user to revert to a previous version of the document if required. XML forms the backbone of many documents, but it is rarely examined for any other purpose.
Whilst authorship tools are beginning to extract some of this XML metadata from student submissions they currently collect only the most basic details. Essentially these techniques dip into reverse engineering, but only in a very limited way. Didriksen (2014) carried out a forensic analysis of Open Office XML (OOXML), the underlying language of a Microsoft Word document, to establish what information can be extracted by unpacking its component parts. Didriksen notes that documents created with word processors may form part of a forensic investigation and that the XML of these files contain data that may support an investigation in a number of ways, including determining the original source of the document and detecting plagiarism. Didriksen’s work examines the underlying XML in some detail, providing a very useful starting point for this type of forensic investigation of Word documents.
Johnson and Davies (2020a) take this a step further by extracting the XML from a student submission known to have been written by an essay writing service. By extracting the XML they were able to examine how the document had been created and edited using Revision Identifiers or ‘rsid’ tags. rsidR tags are used to mark changes carried out within the numerous editing sessions for a document, and editing carried out in a single session will share the same rsidR value, which are randomly assigned throughout the life of the document. Documents with very few edit (rsidR) tags suggest that very little editing has been carried out, which could be indicative of contracted work that has been saved into a new document ready for submission. Examining where the edits have been carried out can also be useful indications of contract cheating, for example, if the only information in a document that has been changed is the student name, or references to a specific course or institution related aspect. To demonstrate what XML markup looks like, a very simple document was created with the text ‘The cat sat on the mat’. The document was saved, and then the word ‘cat’ was changed to ‘dog’ and the document resaved. The XML for the paragraph containing the text appears as shown in Fig. 2:
In this example, the mark up shows where the word ‘cat’ was changed to ‘dog’ in a separate editing session (or rsidR session) to the rest of the text (highlighted in bold in the example). Here, the word ‘dog’ has an rsidR value of 006D0D43, whilst the rest of the sentence has the rsidR value 001E7189. This shows that the word ‘dog’ was added to the document at a separate time to the rest of the sentence.
This XML can also help to identify the URLs of images if they have been copied from the Internet, as well as highlight where fonts and other features have been changed, something that has been further explored in Johnson and Davies (2020b) by looking at things such as underlying font styles, the relative frequency of formatting tags in plagiarised works, redundant formatting tags (that possibly suggest something has been deleted or reformatted to match the default document styles), frequency of words to rsidR edits and more, in an attempt to build a series of ‘flags’ for assessors. In addition, the media files that are available when the document has been converted to its component parts show images in full, even if the document itself shows the cropped version, which can provide useful information if taken from screen shots (where information that helps to identify the original source may appear in the cropped-out area). Jeong and Lee (2017) also note that two documents sharing a rsid value in the settings.xml file indicates that both files have originated from the same source and suggest that these findings could be useful in fraud and forgery cases, though this would also be highly beneficial in providing evidence for allegations of collusion.