[CS] Automating FTP folder structure updates with GitHub Actions
In this post, I explain how I used GitHub Actions to automate a task that I needed: regularly updating file and folder structure information from the ENSEMBL Beta FTP server. I needed the updated structure data so I could count the number of species (represented by the top-level folder names) and the number of genomes (folders with names starting with “GCA_” or “GCF_”) for each species. I then output the counts to a table and saved it as a text file. Below is a step-by-step guide that walks you through the process.
Reference: GitHub - snu-cdrc/gencube/.github/workflows
1. Create a GitHub Repository
- Go to GitHub and click “New repository”.
- Give your repository a name (e.g.,
ensembl-beta-ftp-structure
) and choose whether to make it public or private. - Click “Create repository”.
2. Clone the Repository Locally
If you’re comfortable working with Git locally, you can clone the repository using your preferred method (e.g., via the command line):
git clone https://github.com/yourusername/ensembl-beta-ftp-structure.git
cd ensembl-beta-ftp-structure
3. Set Up the GitHub Actions Directory
GitHub Actions workflows must reside in a specific folder. Create the following folder structure in the root of your repository:
ensembl-beta-ftp-structure/
└── .github/
└── workflows/
4. Create the Workflow File
Inside the workflows
folder, create a new file called ensembl-beta-ftp-structure.yml
. Open the file in your favorite text editor and paste the following YAML content.
name: Update FTP Folder Structure
on:
schedule:
# Runs every 30 minutes (UTC)
- cron: '0 15 * * *'
workflow_dispatch:
jobs:
update_structure:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
persist-credentials: false
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y lftp
pip install --upgrade pip
- name: Fetch FTP folder structure using ls -R
env:
FTP_SERVER: "ftp.ebi.ac.uk" # Replace with your FTP server if needed
FTP_PATH: "/pub/ensemblorganisms" # Replace with desired FTP path
run: |
lftp -c "open $FTP_SERVER; cd $FTP_PATH; ls -R" > ensembl-beta_ftp_structure.txt
- name: Convert structure to JSON
run: |
python3 << 'EOF'
import json
def parse_ls_output(file_path):
tree = {}
current_path = [] # Current directory path (list)
with open(file_path, "r") as f:
for line in f:
line = line.rstrip("\n")
if not line:
continue
# Directory header (lines ending with ':')
if line.endswith(":"):
header = line[:-1]
# If header is the root, set current_path to empty list
if header == ".":
current_path = []
else:
# Remove the "./" prefix if present, then split path by "/"
if header.startswith("./"):
header = header[2:]
current_path = header.split("/")
continue
# The ls output is in the format: permissions, link count, owner, group, size, date, time, filename.
# Since filenames may contain spaces, join fields from the 9th element onward.
parts = line.split()
if len(parts) < 9:
continue # Skip unexpected format lines
filename = " ".join(parts[8:])
# If the first field starts with 'd', it is a directory.
is_dir = parts[0].startswith("d")
# Navigate through the tree based on the current path.
node = tree
for part in current_path:
node = node.setdefault(part, {})
# Add the file or directory entry.
if is_dir:
node[filename] = {}
else:
node[filename] = None
return tree
if __name__ == "__main__":
input_file = "ensembl-beta_ftp_structure.txt" # File generated by lftp
output_file = "ensembl-beta_ftp_structure.json"
directory_tree = parse_ls_output(input_file)
with open(output_file, "w") as out_f:
json.dump(directory_tree, out_f, indent=4)
print(f"Directory structure saved to {output_file}.")
EOF
- name: Generate information file
run: |
python3 << 'EOF'
import json
# Load the JSON file containing the folder structure.
with open("ensembl-beta_ftp_structure.json", "r") as f:
data = json.load(f)
# Prepare the header and rows.
ls_species = ["Species\tGenome"]
ls_total = ["Species\tGenome"]
species_num = 0
genome_num = 0
for species, genome in data.items():
# Pass the "test" folder
if species == "test":
continue
# Count genome IDs that start with "GCA_" or "GCF_"
count = sum(1 for genome_id in genome.keys() if genome_id.startswith("GCA_") or genome_id.startswith("GCF_"))
ls_species.append(f"{species}\t{count}")
species_num += 1
genome_num += count
# Total number of species and genomes
ls_total.append(f"{species_num}\t{genome_num}")
# Write output tables
with open("ensembl-beta_count_species.txt", "w") as out_file:
out_file.write("\n".join(ls_species))
with open("ensembl-beta_count_total.txt", "w") as out_file:
out_file.write("\n".join(ls_total))
EOF
- name: Commit and push changes
run: |
git config --global user.name "keun-hong"
git config --global user.email "thsrms9216@gmail.com"
if [ -n "$(git status --porcelain ensembl-beta_ftp_structure.json ensembl-beta_count_species.txt ensembl-beta_count_total.txt)" ]; then
git add ensembl-beta_ftp_structure.txt ensembl-beta_ftp_structure.json ensembl-beta_count_species.txt ensembl-beta_count_total.txt
git commit -m "Update FTP folder structure and information"
git push
else
echo "No changes to commit."
fi
5. Use a Personal Access Token (PAT)
Steps to Create and Register Your PAT
-
Generate a PAT:
- Go to the GitHub Personal Access Token creation page and generate a new token with the required scopes (e.g.,
repo
).
- Go to the GitHub Personal Access Token creation page and generate a new token with the required scopes (e.g.,
-
Copy the Token:
- Once generated, copy the token value.
-
Add the PAT as a Secret in Your Repository:
- Go to your repository.
- Click the “Settings” tab.
- In the left sidebar, click “Secrets and variables” then “Actions”.
- Click “New repository secret”.
- Enter
MY_PAT
in the Name field and paste your token in the Value field. - Click “Add secret”.
-
Update the Workflow’s Push Command:
- In the workflow file, the push command has been modified to use the secret:
6. Commit and Push Your Workflow File
Save the workflow file and push it to your repository:
git add .github/workflows/ensembl-beta-ftp-structure.yml
git commit -m "Add workflow to update FTP folder structure and information"
git push
7. Verify the Workflow
- Navigate to your repository on GitHub.
- Click the “Actions” tab. You should see your “Update FTP Folder Structure” workflow listed.
- You can manually trigger the workflow using the “Run workflow” button or wait for the scheduled run.
8. Access the Output
After the workflow runs, the following files will be generated in your repository:
ensembl-beta_ftp_structure.txt
: The raw output from the lftp command.ensembl-beta_ftp_structure.json
: The JSON file with the parsed folder structure.ensembl-beta_count_species.txt
: A tab-separated table containing the species name and the number of genome IDs .ensembl-beta_count_total.txt
: A tab-separated table containing the total number of species and genome IDs .
Since your repository is public, these files are accessible at any time.
By following these steps, you’ve successfully set up GitHub Actions to automatically and regularly update your FTP folder structure data and generate useful summary information. Enjoy exploring the creative uses of GitHub Actions beyond traditional CI/CD tasks!
Leave a comment