# Read in file
from pathlib import Path
import pandas
import csv
data_path = Path('.') / "Data" # Replace with your own path to your data if applicable
df = pandas.read_csv(data_path / "Example_Data.csv")
# Replace the names with pseudonyms and save the new mapping
df['Username'] = ['User_' + str(i) for i in range(1, len(df) + 1)]
map = df[['Name', 'Surname', 'Username']]
map.to_csv(data_path / "mapping.csv", index = False)
# Redact the ZIP Code
df['ZIP_Code_redacted'] = df['ZIP Code'].apply(lambda x: str(x)[0] + 'xxxx')
# Shorten the Birth date to year and distort year
df['Birth_Date_distorted'] = df['Birth Date'].apply(lambda x: int(str(x)[-4:])+5)
# We can also round the number instead
df['Birth_Date_rounded'] = df['Birth Date'].apply(lambda x: round(int(str(x)[-4:]), -1) )
# Print in new file
df = df.drop(columns = ['Name', 'Surname'])
df = df[['Username', 'Birth_Date_rounded', 'Gender', 'ZIP_Code_redacted', 'Complaint']]
df.to_csv(data_path / "NewData.csv", index = False)
print(df)Learn how to manage your data using research software
In this fifth session of the Programming Café, we will have a look at the content of the Research Data Management Week and find out, whether and how we can automate or ease these processes using research software and/or tools.
What will we cover?
Data sharing
- De-identification of data
Collaborating
- Automatic creation of project structures
- Generating README files
- Check your file naming automatically
- Check whether your file format is preferred
- Have a look at CodeMeta
Other useful tools and software
- Quarto
- Software Management Plan
Preparation for the session
Before we (or you at home) start diving into the content, make sure to open a fitting IDE (such as PyCharm and Visual Studio Code) or to install and open Jupyter notebook (You can do that using Anaconda).
Data sharing – De-identification of data
In last Tuesday’s session, you learned how to share your data responsibly. You explored the de-identification of qualitative data and learned how to prepare your publication package.
Programming can assist you in de-identifying your data! Using research software, you can choose to either pseudonymize or anonymize it. Let’s take a closer look at how we can replace personal names, distort ages, and redact ZIP codes in this data:
In the following output Birth_Date_rounded has been shortened to BD and ZIP_Code_redacted to ZIP to allow for better readability on the website:
Username BD Gender ZIP Complaint0 User_1 1960 Male 2xxxx Short of breath1 User_2 1960 Male 2xxxx Chest pain2 User_3 1960 Female 2xxxx Painful eye3 User_4 1960 Female 2xxxx Wheezing4 User_5 1960 Female 2xxxx Aching joints5 User_6 1960 Female 2xxxx Chest pain6 User_7 1960 Male 2xxxx Short of breath
If you have saved a specific mapping, you can use it to pseudonymize textual data. In the following example, we used the generated usernames to replace the names in the text. If you want to follow along, just copy your rename your mapping.csv-file to mapping_manual.csv. You can then also add the following two lines (manually):
Manchester,England1965,the 60s# Use your mapping file to change texts
# Read in the mapping
mapping = {}
with open (data_path / "mapping_manual.csv", mode = 'r', encoding = 'utf-8') as mapfile:
map = csv.reader(mapfile, delimiter=',')
for line in map:
mapping[line[0]] = line[1]
with open(data_path / 'text_example.txt', mode = 'r', encoding = 'utf-8') as txtfile:
content = txtfile.read()
# Replace words in file with mapping file content
new_content, document = content, content
for key, value in mapping.items():
new_content = new_content.replace(key, value)
document = document.replace(key, f"{value} [Replaced - {key}]")
# Save two versions to also allow for documentation in the text
with open(data_path / 'output.txt', mode = 'w', encoding = 'utf-8') as outfile:
print(new_content, file = outfile)
with open(data_path / 'documentation.txt', mode = 'w', encoding = 'utf-8') as docfile:
print(document, file = docfile)Following, the documented version of the text would look like this:
In this interview, we talk with User_1 [Replaced - Sean Curtis] about his wife User_4 [Replaced - Marion Aaker]. User_1 [Replaced - Sean Curtis] said that his wife has been wheezing since some weeks, while he mainly feels short of breath. Both born in the 60s [Replaced - 1965], Doctors may think it is related to their age and that were born in England [Replaced - Manchester].
How to anonymize your data
To truly anonymize the data instead, I recommend using an open-source tool such as Textwash. Let’s take a quick look at how to use it:
First, download the module from GitHub. Then, open the terminal, navigate to the directory containing the downloaded files, and enter the following commands
conda create -n textwash python=3.9
conda activate textwash
pip install -r requirements.txt
Do not forget to download the models and paste them into the data folder!
After preparing the folder, you can run the code below!
python3 anon.py --language en --input_dir my_documents --output_dir anonymised_documents --cpu
With this, this text here
In this interview, we talk with Sean Curtis about his wife Marion Aaker. Sean Curtis said that his wife has been wheezing since some weeks, while he mainly feels short of breath. Both born in 1965, Doctors may think it is related to their age and that were born in Manchester.
changes to this:
In this interview, we talk with PERSON_FIRSTNAME_1 PERSON_LASTNAME_1 about PRONOUN wife PERSON_FIRSTNAME_2 PERSON_LASTNAME_2. PERSON_FIRSTNAME_1 PERSON_LASTNAME_1 said that PRONOUN wife has been wheezing since some weeks, while PRONOUN mainly feels short of breath. Both born in DATE_1, Doctors DATE_1 think it is related to their age and that were born in LOCATION_1.
Collaborating - Create folder structure
In last Thursday’s session on collaboration, you learned more about structuring a research project. Did you know that Python can help you create a structured project folder?
The following code creates a folder structure as recommended by the good-enough-project:
from pathlib import Path
def create_folder_setup(project_name):
# Create a project folder in documents
project_path = Path.home() / 'Documents' / project_name
if not Path(project_path).exists():
Path.mkdir(project_path)
# Create a list of folders that should be created
folders = ['src', 'data', 'docs', 'results', 'config', 'bin', 'data/raw', 'data/processed', 'data/temp', 'docs/manuscript', 'docs/reports', 'results/figures', 'results/output']
# Create folders
for folder in folders:
if not Path(project_path / folder).exists():
Path.mkdir(project_path / folder)
# Create a list of files that should be created
files = ['README.md', 'LICENSE.md', 'CITATION.md', '.gitignore', 'requirements.txt', 'docs/manuscript/notes.txt']
# Create files
for file in files:
if not Path(project_path / file).exists():
Path.touch(project_path / file)
create_folder_setup("TestProject")Although we’ve already created an empty README file, there are useful templates available to help you complete it. In this tutorial, you’ll find specific editors designed to assist with filling them in.
Collaborating - How to paste the folder and file structure into the README
What we might want to use Python for directly is inserting file paths into a README. For that, we can reuse content from another Programming Café session.
The goal is to create the following overview:
TestProject/
├──LICENSE.md
├──requirements.txt
├──README.md
├──.gitignore
├──CITATION.md
├──bin/
├──config/
├──docs/
├──manuscript/
├──notes.txt
├──reports/
├──results/
├──output/
├──figures/
├──data/
├──temp/
├──processed/
├──raw/
├──src/
from pathlib import Path
def generate_README(project_path):
# Open the Readme file, print the header and then all information about the files
with open(project_path /"README.md", mode = "w", encoding = "utf-8") as file_out:
# for every directory/folder that we find...
for dirpath, dirnames, files in Path.walk(project_path):
# calculate needed indent by looking at the depth of the directory
dir_length = len(dirpath.parts) - len(project_path.parts)
indent = dir_length * '\t'
# print indented directory name
print(f"{indent}├──{dirpath.name}/", file = file_out) if dir_length > 0 else print(f"{dirpath.name}/", file = file_out)
#...iterate through files and print file information in the readme
for file in files:
print(f"{indent}├──{file}", file = file_out)
generate_README(Path.home() / 'Documents' / 'TestProject') # Replace with your own preferred pathCollaborating - How to check whether your file names fit your conventions
You can also use Python to verify whether your file names match the pattern you’ve chosen. You can create the pattern using one of these useful websites:
- https://regexr.com
- https://regex101.com
The example below matches it to this pattern YYYYMMDD_ExperimentID_Task_Version.
It fits the following file names:
- 19951231_exp1543_interview_v1.csv
- 20030128_exp1549_survey_v2.csv
If you want to know how to rename your files, have a look at the content of our previous programming café!
from pathlib import Path
import re
def validate_filenames(project_path):
# Create file that will contain the files that have been validated
with open(project_path /'validated_files.csv', 'w', newline='') as csvfile:
validated = csv.writer(csvfile, delimiter='\t')
# for every file in the project...
for dirpath, dirnames, files in Path.walk(project_path):
for file in files:
# if it is a csv file...
if file.endswith((".csv", ".txt", ".odt")):
# ...define pattern... (the example matches YYYYMMDD_expID_Task_Version)
# example file names: 19951231_exp1543_interview_v1.csv and 20030128_exp1549_survey_v2.csv
pattern = r"^(?:19|20)\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\d|3[01])_exp\d{4}_\w+_v\d+" # FILL IN YOUR FILE PATTERN HERE
# ...and test against pattern and print results
if re.match(pattern, file) is None:
validated.writerow([file,'tested','not consistent'])
else:
validated.writerow([file,'tested','consistent'])
# else print that the file has not been tested
else:
validated.writerow([file,'not tested','NA'])
validate_filenames(Path.home() / 'Documents' / 'TestProject') # Replace with your own preferred pathCollaborating - How to validate your file format
Using the same techniques as before, you can also use Python to check whether the file formats you’ve chosen are preferred. To do this, you’ll need to create a list of all preferred formats (you can base it on the DANS list of preferred file formats here).
def validate_filetype(project_path):
# for every file in the project...
for dirpath, dirnames, files in Path.walk(project_path):
for file in files:
valid = ['.odt', '.pdf', '.txt', '.xml', '.html', '.md', '.ods', '.csv','.dat', '.sps', '.DO', '.R','.siard', '.sql', '.txt', '.py']
# ...check whether file extension is NOT in list of preferred formats...
if not any(file.endswith(ext) for ext in valid):
#... and print warning
print(f"{file} is not in a preferred file format")
validate_filetype(Path.home() / 'Documents' / 'TestProject')Other useful tools and links
- Quarto: An open-source tools that helps you publish your code.
- OpenRefine: An open-source tool that helps you clean your data without knowing how to program. Find an online tutorial here.
- Software Management Plan: This decision tree tool helps you to fill in your own software management plan. While it is not mandatory for EUR researchers, I can only recommend filling it.
- Use GitHub to collaborate with others on your code
- CodeMeta: A metadata standard for your software
- How to FAIRify your Research Software: Helpful tips to make your Research Software FAIR
- Assess your own software in terms of FAIR
- Recommendations for FAIR software