Learn how to manage your data using research software
In this fifth session of the Programming Café, we will have a look at the content of the Research Data Management Week and find out, whether and how we can automate or ease these processes using research software and/or tools.
What will we cover?
Data sharing
De-identification of data
Collaborating
Automatic creation of project structures
Generating README files
Check your file naming automatically
Check whether your file format is preferred
Have a look at CodeMeta
Other useful tools and software
Quarto
Software Management Plan
Preparation for the session
Before we (or you at home) start diving into the content, make sure to open a fitting IDE (such as PyCharm and Visual Studio Code) or to install and open Jupyter notebook (You can do that using Anaconda).
EUR internals: Find PyCharm here and Visual Studio Code here
Self-managed PC: Find PyCharm here and Visual Studio Code here
Data sharing – De-identification of data
In last Tuesday’s session, you learned how to share your data responsibly. You explored the de-identification of qualitative data and learned how to prepare your publication package.
Programming can assist you in de-identifying your data! Using research software, you can choose to either pseudonymize or anonymize it. Let’s take a closer look at how we can replace personal names, distort ages, and redact ZIP codes in this data:
# Read in filefrom pathlib import Pathimport pandasimport csvdata_path = Path('.') /"Data"# Replace with your own path to your data if applicabledf = pandas.read_csv(data_path /"Example_Data.csv")# Replace the names with pseudonyms and save the new mappingdf['Username'] = ['User_'+str(i) for i inrange(1, len(df) +1)]map= df[['Name', 'Surname', 'Username']]map.to_csv(data_path /"mapping.csv", index =False)# Redact the ZIP Codedf['ZIP_Code_redacted'] = df['ZIP Code'].apply(lambda x: str(x)[0] +'xxxx')# Shorten the Birth date to year and distort yeardf['Birth_Date_distorted'] = df['Birth Date'].apply(lambda x: int(str(x)[-4:])+5)# We can also round the number insteaddf['Birth_Date_rounded'] = df['Birth Date'].apply(lambda x: round(int(str(x)[-4:]), -1) )# Print in new filedf = df.drop(columns = ['Name', 'Surname'])df = df[['Username', 'Birth_Date_rounded', 'Gender', 'ZIP_Code_redacted', 'Complaint']]df.to_csv(data_path /"NewData.csv", index =False)print(df)
Username Birth_Date_rounded Gender ZIP_Code_redacted Complaint
0 User_1 1960 Male 2xxxx Short of breath
1 User_2 1960 Male 2xxxx Chest pain
2 User_3 1960 Female 2xxxx Painful eye
3 User_4 1960 Female 2xxxx Wheezing
4 User_5 1960 Female 2xxxx Aching joints
5 User_6 1960 Female 2xxxx Chest pain
6 User_7 1960 Male 2xxxx Short of breath
If you have saved a specific mapping, you can use it to pseudonymize textual data. In the following example, we used the generated usernames to replace the names in the text. If you want to follow along, just copy your rename your mapping.csv-file to mapping_manual.csv. You can then also add the following two lines (manually):
Manchester,England
1965,the 60s
# Use your mapping file to change texts# Read in the mappingmapping = {}withopen (data_path /"mapping_manual.csv", mode ='r', encoding ='utf-8') as mapfile:map= csv.reader(mapfile, delimiter=',')for line inmap: mapping[line[0]] = line[1]withopen(data_path /'text_example.txt', mode ='r', encoding ='utf-8') as txtfile: content = txtfile.read()# Replace words in file with mapping file contentnew_content, document = content, contentfor key, value in mapping.items(): new_content = new_content.replace(key, value) document = document.replace(key, f"{value} [Replaced - {key}]")# Save two versions to also allow for documentation in the textwithopen(data_path /'output.txt', mode ='w', encoding ='utf-8') as outfile:print(new_content, file= outfile)withopen(data_path /'documentation.txt', mode ='w', encoding ='utf-8') as docfile:print(document, file= docfile)
Following, the documented version of the text would look like this:
In this interview, we talk with User_1 [Replaced - Sean Curtis] about his wife User_4 [Replaced - Marion Aaker]. User_1 [Replaced - Sean Curtis] said that his wife has been wheezing since some weeks, while he mainly feels short of breath. Both born in the 60s [Replaced - 1965], Doctors may think it is related to their age and that were born in England [Replaced - Manchester].
How to anonymize your data
To truly anonymize the data instead, I recommend using an open-source tool such as Textwash. Let’s take a quick look at how to use it:
First, download the module from GitHub. Then, open the terminal, navigate to the directory containing the downloaded files, and enter the following commands
After preparing the folder, you can run the code below!
python3 anon.py --language en --input_dir my_documents --output_dir anonymised_documents --cpu
With this, this text here
In this interview, we talk with Sean Curtis about his wife Marion Aaker. Sean Curtis said that his wife has been wheezing since some weeks, while he mainly feels short of breath. Both born in 1965, Doctors may think it is related to their age and that were born in Manchester.
changes to this:
In this interview, we talk with PERSON_FIRSTNAME_1 PERSON_LASTNAME_1 about PRONOUN wife PERSON_FIRSTNAME_2 PERSON_LASTNAME_2. PERSON_FIRSTNAME_1 PERSON_LASTNAME_1 said that PRONOUN wife has been wheezing since some weeks, while PRONOUN mainly feels short of breath. Both born in DATE_1, Doctors DATE_1 think it is related to their age and that were born in LOCATION_1.
Collaborating - Create folder structure
In last Thursday’s session on collaboration, you learned more about structuring a research project. Did you know that Python can help you create a structured project folder?
The following code creates a folder structure as recommended by the good-enough-project:
from pathlib import Pathdef create_folder_setup(project_name):# Create a project folder in documents project_path = Path.home() /'Documents'/ project_nameifnot Path(project_path).exists(): Path.mkdir(project_path)# Create a list of folders that should be created folders = ['src', 'data', 'docs', 'results', 'config', 'bin', 'data/raw', 'data/processed', 'data/temp', 'docs/manuscript', 'docs/reports', 'results/figures', 'results/output']# Create foldersfor folder in folders:ifnot Path(project_path / folder).exists(): Path.mkdir(project_path / folder)# Create a list of files that should be created files = ['README.md', 'LICENSE.md', 'CITATION.md', '.gitignore', 'requirements.txt', 'docs/manuscript/notes.txt']# Create filesforfilein files:ifnot Path(project_path /file).exists(): Path.touch(project_path /file)create_folder_setup("TestProject")
Although we’ve already created an empty README file, there are useful templates available to help you complete it. In this tutorial, you’ll find specific editors designed to assist with filling them in.
Collaborating - How to paste the folder and file structure into the README
What we might want to use Python for directly is inserting file paths into a README. For that, we can reuse content from another Programming Café session.
The goal is to create the following overview: markdown text TestProject/ ├──LICENSE.md ├──requirements.txt ├──README.md ├──.gitignore ├──CITATION.md ├──bin/ ├──config/ ├──docs/ ├──manuscript/ ├──notes.txt ├──reports/ ├──results/ ├──output/ ├──figures/ ├──data/ ├──temp/ ├──processed/ ├──raw/ ├──src/
from pathlib import Pathdef generate_README(project_path):# Open the Readme file, print the header and then all information about the fileswithopen(project_path /"README.md", mode ="w", encoding ="utf-8") as file_out:# for every directory/folder that we find...for dirpath, dirnames, files in Path.walk(project_path):# calculate needed indent by looking at the depth of the directory dir_length =len(dirpath.parts) -len(project_path.parts) indent = dir_length *'\t'# print indented directory nameprint(f"{indent}├──{dirpath.name}/", file= file_out) if dir_length >0elseprint(f"{dirpath.name}/", file= file_out)#...iterate through files and print file information in the readmeforfilein files:print(f"{indent}├──{file}", file= file_out)generate_README(Path.home() /'Documents'/'TestProject') # Replace with your own preferred path
Collaborating - How to check whether your file names fit your conventions
You can also use Python to verify whether your file names match the pattern you’ve chosen. You can create the pattern using one of these useful websites: - https://regexr.com - https://regex101.com
The example below matches it to this pattern YYYYMMDD_ExperimentID_Task_Version.
It fits the following file names: - 19951231_exp1543_interview_v1.csv - 20030128_exp1549_survey_v2.csv
If you want to know how to rename your files, have a look at the content of our previous programming café!
from pathlib import Pathimport redef validate_filenames(project_path):# Create file that will contain the files that have been validatedwithopen(project_path /'validated_files.csv', 'w', newline='') as csvfile: validated = csv.writer(csvfile, delimiter='\t')# for every file in the project...for dirpath, dirnames, files in Path.walk(project_path):forfilein files:# if it is a csv file...iffile.endswith((".csv", ".txt", ".odt")):# ...define pattern... (the example matches YYYYMMDD_expID_Task_Version)# example file names: 19951231_exp1543_interview_v1.csv and 20030128_exp1549_survey_v2.csv pattern =r"^(?:19|20)\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\d|3[01])_exp\d{4}_\w+_v\d+"# FILL IN YOUR FILE PATTERN HERE# ...and test against pattern and print resultsif re.match(pattern, file) isNone: validated.writerow([file,'tested','not consistent'])else: validated.writerow([file,'tested','consistent'])# else print that the file has not been testedelse: validated.writerow([file,'not tested','NA'])validate_filenames(Path.home() /'Documents'/'TestProject') # Replace with your own preferred path
Collaborating - How to validate your file format
Using the same techniques as before, you can also use Python to check whether the file formats you’ve chosen are preferred. To do this, you’ll need to create a list of all preferred formats (you can base it on the DANS list of preferred file formats here).
def validate_filetype(project_path):# for every file in the project...for dirpath, dirnames, files in Path.walk(project_path):forfilein files: valid = ['.odt', '.pdf', '.txt', '.xml', '.html', '.md', '.ods', '.csv','.dat', '.sps', '.DO', '.R','.siard', '.sql', '.txt', '.py']# ...check whether file extension is NOT in list of preferred formats...ifnotany(file.endswith(ext) for ext in valid):#... and print warningprint(f"{file} is not in a preferred file format")validate_filetype(Path.home() /'Documents'/'TestProject')
Other useful tools and links
Quarto: An open-source tools that helps you publish your code.
OpenRefine: An open-source tool that helps you clean your data without knowing how to program. Find an online tutorial here.
Software Management Plan: This decision tree tool helps you to fill in your own software management plan. While it is not mandatory for EUR researchers, I can only recommend filling it.
Use GitHub to collaborate with others on your code