Lesson script - Programming Café on 9 Dec
In this second session of the Programming Café, we’ll dive into how to use Python scripts to automate repetitive tasks in your research. We’ll explore the basics of automation, focusing on how to effectively move, rename and copy your files and write clear, functional scripts.
We will start with a question: Which tasks would you like to automate in your research ? I hope that I can show you some tricks for these tasks today.
Requirements to start
Did everyone install Python ? Who needs help ? I would recommend the use of PyCharm!
- Please create a copy of the files that you want to work on.
- You can also download test files extracted from the German Summary Corpus (Wedig & Strobl, 2024) here.
- Create a folder “data” in your projects and a subfolder “files” in which you paste your files
Import the necessary modules and create a main function to coordinate all tasks
import os #package that gives us access to the operating system
import shutil #package that gives us access to file operations
import re #package that gives us access to regular expressions
import time #package that gives us access to time
if __name__ == '__main__':
# will be filled with our functions
File Management
During a busy period like a PhD, tasks such as tracking existing files or creating backups may be overlooked. Python can help automate these tasks, allowing you to complete them with the press of a button.
Generate a file structure automatically
At the beginning of your project (or at a later stage), you might wonder how to best structure your files. This code snippet allows you to automate the process. Be aware: data/copy will be created for this session today and is not typical in a research project
def create_folders():
folders = ["data/raw", "data/processed", "data/copy" "scripts", "results"] # Here you can define how you would like to name your folders
# Create folders
for folder in folders:
os.makedirs(folder, exist_ok=True)
Get an overview of all files in one folder
To get an overview of all the files in a folder, you can use the listdir() function.
def list_files():
# Get the directory path
working_dir = os.getcwd()
path = os.path.join(working_dir, "data", "files")
print(os.listdir(path)) # print allows us to see the results on the console.
Copying files: e.g., to create a backup
You might find yourself in a situation where you want to create copies of your files, such as for backup purposes. For this, you can use the shutil package.
def copy_files():
# Get the directory path
working_dir = os.getcwd()
path = os.path.join(working_dir, "data", "files")
# this is how we copy one file
source = os.path.join(path, "L1_Ki_02.csv")
destination = os.path.join(working_dir, "data", "test", "copy_L1_Ki_02.csv")
shutil.copyfile(source, destination)
# we can also copy a while directory
shutil.copytree(os.path.join(working_dir, "data", "files"), os.path.join(working_dir, "data", "copy"), dirs_exist_ok = True)
Renaming files
After copying the files, we can perform some small operations on them!
Have you ever found yourself in a situation where you realized you didn’t follow a clear concept in your file naming or ordered the sequences within the file names incorrectly? Let’s fix that!
Let’s start simple: add “exp” in front of the current file name.
def change_name():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
#for every file in that folder, we want to create a new file name and then use rename to change the name
for file in os.listdir(folder_path):
new_name = "exp_" + file
os.rename(os.path.join(folder_path, file), os.path.join(folder_path, new_name))
Now a bit more complex, let’s only change csv files
def change_name_csv():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# for every file, we want to change, but only if the file ends with ".csv"
for file in os.listdir(folder_path):
if file.endswith(".csv"):
new_name = "csv_" + file
os.rename(os.path.join(folder_path,file), os.path.join(folder_path,new_name))
We can also make it a bit more advanced here and only add the string “csv” where “csv” is not added (or included in the name already)
def change_name_csv_advanced():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# for every file, we want to change, but only if the file ends with ".csv" and does not include csv already
for file in os.listdir(folder_path):
if file.endswith(".csv") and "csv" not in file:
new_name = "csv_" + file
os.rename(os.path.join(folder_path, file), os.path.join(folder_path, new_name))
We can also adjust this in more detail using regular expressions
def change_name_re():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# Create a pattern (\s means whitespace)
pattern = re.compile(r"(\s+)")
# and define what you want to replace it with
replace_with = "_"
# for every file, check whether it fits the pattern and change the name accordingly
for file in os.listdir(folder_path):
new_name = pattern.sub(replace_with, file)
os.rename(os.path.join(folder_path, file), os.path.join(folder_path, new_name))
Regular expressions are not limited to spaces, you can adapt them to match your file names. Best to use is regexr for this. They also provide a handy cheatsheet.
Moving files based on criteria - Sort your files
Another task that you may be interested in is sorting files. To sort files, we can use shutil again. We can sort files according to various criteria. For now, I show you file types and file names.
Sort all your files based on file type
To sort files by file type, we can use the .endswith() function that we already used previously.
def sort_files_type():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# Decide on the names of the new directories (can also be nicer, but quick and dirty now)
destination_csv = os.path.join(working_dir, "data","csv_files")
destination_txt = os.path.join(working_dir, "data","txt_files")
# Create the directories
os.makedirs(destination_csv, exist_ok=True)
os.makedirs(destination_txt, exist_ok=True)
# for every file in the directory, sort in the right new folder
for file in os.listdir(folder_path):
if file.endswith(".csv"):
shutil.move(os.path.join(folder_path, file), os.path.join(destination_csv, file))
elif file.endswith(".txt"):
shutil.move(os.path.join(folder_path, file), os.path.join(destination_txt, file))
Sort your files based on file name
To sort the files by name or certain characteristics of the names, we use the “in” operator.
def sort_file_name():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# Decide on the names of the new directories (can also be nicer, but quick and dirty now)
destination_L1 = os.path.join(working_dir, "data","L1")
destination_L2 = os.path.join(working_dir, "data","L2")
os.makedirs(destination_L1, exist_ok=True)
os.makedirs(destination_L2, exist_ok=True)
# for every file in the directory, sort in the right new folder using "in"
for file in os.listdir(folder_path):
if "L1" in file:
shutil.move(os.path.join(folder_path, file), os.path.join(destination_L1, file))
elif "L2" in file:
shutil.move(os.path.join(folder_path, file), os.path.join(destination_L2, file))
Documentation
For a research project, you might also want a list of all files along with their modification dates. We can use Python for that!
Generate a README with all file names and their last modification date
def generate_README():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data")
# Open the Readme file, print the header and then all information about the files
with open("README.txt", mode = "w", encoding = "utf-8") as file_out:
print("directory", "subdirectory", "file", "modification_date", sep = "\t", file = file_out)
# for every directory/folder that we find...
for dirpath, dirnames, files in os.walk(folder_path):
#...iterate through files and print information in the readme
for file in files:
path = os.path.join(dirpath, file)
mod_time = time.ctime(os.path.getmtime(path))
print(os.path.basename(folder_path), os.path.basename(dirpath), file, mod_time, sep = "\t", file = file_out)
In the end, our main function may look like this:
if __name__ == '__main__':
create_folders()
list_files()
copy_files()
change_name()
change_name_csv()
change_name_csv_advanced()
sort_file_name()
sort_file_type()
generate_README()
I hope that this already helps you a bit for your research projects!
Any other task that you want to optimize/automate?
Here is a nice resource with even more topics!
Extra information pathlib:
You can also use the pathlib package to make searching for the right directory easier.
For that, import the following package:
from pathlib import Path
After that you can rewrite the above mentioned code snippets:
def create_folders():
folders = ["data/raw", "data/processed", "data/copy" "scripts", "results"] # Here you can define how you would like to name your folders
# Create folders
for folder in folders:
os.makedirs(folder, exist_ok=True)
def list_files():
# Get the directory path
working_dir = os.getcwd()
path = os.path.join(working_dir, "data", "files")
print(os.listdir(path)) # print allows us to see the results on the console.
def change_name():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
#for every file in that folder, we want to create a new file name and then use rename to change the name
for file in os.listdir(folder_path):
new_name = "exp_" + file
os.rename(os.path.join(folder_path, file), os.path.join(folder_path, new_name))
def change_name_csv():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# for every file, we want to change, but only if the file ends with ".csv"
for file in os.listdir(folder_path):
if file.endswith(".csv"):
new_name = "csv_" + file
os.rename(os.path.join(folder_path,file), os.path.join(folder_path,new_name))
def change_name_csv_advanced():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# for every file, we want to change, but only if the file ends with ".csv" and does not include csv already
for file in os.listdir(folder_path):
if file.endswith(".csv") and "csv" not in file:
new_name = "csv_" + file
os.rename(os.path.join(folder_path, file), os.path.join(folder_path, new_name))
def sort_files_type():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# Decide on the names of the new directories (can also be nicer, but quick and dirty now)
destination_csv = os.path.join(working_dir, "data","csv_files")
destination_txt = os.path.join(working_dir, "data","txt_files")
# Create the directories
os.makedirs(destination_csv, exist_ok=True)
os.makedirs(destination_txt, exist_ok=True)
# for every file in the directory, sort in the right new folder
for file in os.listdir(folder_path):
if file.endswith(".csv"):
shutil.move(os.path.join(folder_path, file), os.path.join(destination_csv, file))
elif file.endswith(".txt"):
shutil.move(os.path.join(folder_path, file), os.path.join(destination_txt, file))
def sort_file_name():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data","copy")
# Decide on the names of the new directories (can also be nicer, but quick and dirty now)
destination_L1 = os.path.join(working_dir, "data","L1")
destination_L2 = os.path.join(working_dir, "data","L2")
os.makedirs(destination_L1, exist_ok=True)
os.makedirs(destination_L2, exist_ok=True)
# for every file in the directory, sort in the right new folder using "in"
for file in os.listdir(folder_path):
if "L1" in file:
shutil.move(os.path.join(folder_path, file), os.path.join(destination_L1, file))
elif "L2" in file:
shutil.move(os.path.join(folder_path, file), os.path.join(destination_L2, file))
def generate_README():
# Get the directory path
working_dir = os.getcwd()
folder_path = os.path.join(working_dir, "data")
# Open the Readme file, print the header and then all information about the files
with open("README.txt", mode = "w", encoding = "utf-8") as file_out:
print("directory", "subdirectory", "file", "modification_date", sep = "\t", file = file_out)
# for every directory/folder that we find...
for dirpath, dirnames, files in os.walk(folder_path):
#...iterate through files and print information in the readme
for file in files:
path = os.path.join(dirpath, file)
mod_time = time.ctime(os.path.getmtime(path))
print(os.path.basename(folder_path), os.path.basename(dirpath), file, mod_time, sep = "\t", file = file_out)