Last update: July 2024

HealthDataAnonym: A Tool for the Anonymization of Health Data

Brief description

Useful health data are frequently recorded in text documents rather than as structured data. Exploiting those data using data mining techniques to try to obtain implicit knowledge can provide substantial benefits, but usually the data analysts should not have access to sensitive data, so the documents should be anonymyzed first. HealthDataAnonym is a tool that facilitates the anonymization task and the application of different data mining tasks in order to evaluate the impact of the anonymization performed.

We developed the tool as a desktop application using the Java programming language (including the java.util.regex Java package, to match relevant regular expressions). We also used the natural language processing library spaCy , which is written in Python, mainly to perform a syntactic analysis of the sentences in the input documents. To integrate Java and Python, we used Jython. Besides anonymizing documents, the tool offers functionalities to apply data mining techniques on them, to assess the impact of the anonymization. For the application of data mining techniques, we used the Weka library; for example, we used functionalities provided by classes like StringToWordVector (to transform an input document into a vector of keywords), Evaluation and 7Classifier, as well as different classes that implement the classification algorithms tested. We also used Balsamiq to design the different windows of the application (mockups) and analyze them with health professionals before implementing them. We tested the tool both in Linux and Windows.

Documentation

Software

Videos

Snapshots

Contributors

Researchers Students (final degree projects)

Acknowledgments

Logos