Data protection in software:

bridging the gap between technical and non-technical specialists

With the advent of the Big Data era, everyone in the twenty-first century is constantly sharing thousands of personal data on the Internet. This data is then routed via billions of digital services of all types, providing immense convenience to our lives. However, in exchange for these benefits, we are compromising our data security and privacy rights. It is difficult to define privacy, as Solove stated in 20061: “Privacy is a concept in disarray. Nobody can articulate what it means. As one commentator has observed, privacy suffers from an embarrassment of meanings.'' Privacy and data protection, in comparison to information security vulnerabilities, is relatively subtle. However, as individuals become more aware of the importance of privacy protection and relevant laws and regulations (such as the EU's GDPR) are steadily strengthened, the question of how to increase privacy and data protection before and during software development is becoming an important research topic. In this blog article, we will take a glance at what software privacy concerns may exist and how software developers (technical) and lawyers (non-technical, e.g. the Data Protection Officers) may work together to investigate and enhance software privacy protection compliance.

First, what should be checked for data privacy in software?

Personal data identification and protection are more difficult to define than standard software security issues. As a result, most current software privacy research has given artificial intelligence the capacity to use a classifier to search for keyword words and assess if they are personal data. This technique has the potential to achieve up to 90% accuracy on a balanced training data set, however creating such a perfect training data set is challenging, and the remaining 10% that is mislabeled can have significant consequences. Thus, we like to design relevant software analysis algorithms in accordance with relevant data protection legal standards, such as GDPR. We can sort out the specifics of software privacy with respect to data protection by establishing a bridge between software developers and lawyers and making them available to both parties, enabling simultaneous privacy compliance monitoring during the software development process.

Automated software security vulnerability detection

Currently, there are three key approaches for automatically detecting software security vulnerabilities: dynamic testing, formal verification, and static analysis2.

  • Dynamic testing methods find errors in software by actually executing the software, and as only a limited number of test cases can be checked, they are not guaranteed to find all security vulnerabilities in software.
  • Formal verification methods include theorem proving and model testing, both of which can precisely determine the properties of software.
  • In recent years, static analysis has received increasing attention as an efficient class of program analysis methods. When the user gives the abstract semantics of the language, this class of methods can automatically discover software properties that satisfy all possible (not necessarily actual) states of execution. Static analysis methods have the advantage of being highly automated, fast and able to examine infinite-state systems. Although static analysis methods may produce some false negatives or false positives, they are still one of the most practical and effective security vulnerability detection methods available today.

Software privacy: from security to GDPR compliance

It is difficult for developers and lawyers to have a mutual understanding and benefit from each other's expertise and analyses. However, GDPR makes a great mediator in between if we can transpose GDPR principles and requirements to actual data flow in software. Here we look at fundamental GDPR obligations from the standpoint of a developer and then consider how we may design a solution to serve both parties in terms of privacy protection.

GDPR states several obligations from the data controller which should be monitored by the DPO:

  • data by default and by design, to have a record of processing activities (article 30);
  • to ensure the security of the processing (article 32);
  • to notify personal data breaches supervisory authorities (article 33);
  • to communicate personal breaches to the data subject (article 34);
  • to conduct Data Privacy Impact Assessment (DPIA) (article 35);
  • to conduct prior consultation with supervisory authorities (article 36).

The DPO's mission is to monitor whether the data controller fulfilled all of his or her commitments, which includes performing a high-quality DPIA3 when required. This makes the assignment of DPIA creation a duty for both data controllers and DPOs. DPIA is a technique that assists developers and organisations in systematically analysing, identifying, and mitigating data protection risks in a project or plan.  The deliverable document created by the approach in this study is referred to as a DPIA.

By giving detailed instructions on generating a successful DPIA, we aim to directly assist developers and DPOs. We can segregate data flows in software that originate or include sensitive user input by using a static analysis technique employed in software security vulnerability detection. These data flows will then be classified and summarised into an abstract form that DPOs may use to answer important DPIA questions.

Bridging technical and non-technical privacy experts - further work

By adopting static analysis, we may utilise logic and rules to select relevant privacy data flows in software so that DPOs and developers can evaluate privacy compliance. However, in order to accomplish a more automated approach, we also aim to identify the most privacy-sensitive software moments. For instance, if the accumulation or modification of personal data indicates a very high level of privacy sensitivity, we may apply a more precise local taint analysis to pinpoint these instances. In the meantime, millions of web applications and front-end applications written in scripting languages such as Javascript cannot be analysed by the bytecode. If we are able to create a large-scale source code analyser similar to Semgrep4, then both developers and DPOs would benefit tremendously.

References:

  1. Solove, Daniel J., A Taxonomy of Privacy. University of Pennsylvania Law Review, Vol. 154, No. 3, p. 477, January 2006, GWU Law School Public Law Research Paper No. 129, Available at SSRN: https://ssrn.com/abstract=667622
  2. Mohamed Almorsy, John Grundy, and Amani S. Ibrahim. 2012. Supporting automated vulnerability analysis using formalized vulnerability signatures. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012). Association for Computing Machinery, New York, NY, USA, 100–109. https://doi.org/10.1145/2351676.2351691
  3. Data Protection Impact Assessment (DPIA), https://gdpr.eu/data-protection-impact-assessment-template/
  4. Semgrep, https://semgrep.dev/

This blogpost was written by Feiyang Tang. Feiyang received his MSc and BSc Honours degrees from KU Leuven and the University of Auckland, respectively. Since 2020, Feiyang has been a Marie Curie PhD fellow at the Norwegian Computing Center (NR) and the Norwegian University of Science and Technology (NTNU), with the objective of proposing a solution for organisations to automatically evaluate data protection compliance in their software products. His research interests include program analysis, information security, machine learning and data mining.

Feiyang Tang