5 minute read

In this post, I explain how I used GitHub Actions to automate a task that I needed: regularly updating file and folder structure information from the ENSEMBL Beta FTP server. I needed the updated structure data so I could count the number of species (represented by the top-level folder names) and the number of genomes (folders with names starting with “GCA_” or “GCF_”) for each species. I then output the counts to a table and saved it as a text file. Below is a step-by-step guide that walks you through the process.

Reference: GitHub - snu-cdrc/gencube/.github/workflows

1. Create a GitHub Repository

  • Go to GitHub and click “New repository”.
  • Give your repository a name (e.g., ensembl-beta-ftp-structure) and choose whether to make it public or private.
  • Click “Create repository”.

2. Clone the Repository Locally

If you’re comfortable working with Git locally, you can clone the repository using your preferred method (e.g., via the command line):

git clone https://github.com/yourusername/ensembl-beta-ftp-structure.git
cd ensembl-beta-ftp-structure

3. Set Up the GitHub Actions Directory

GitHub Actions workflows must reside in a specific folder. Create the following folder structure in the root of your repository:

ensembl-beta-ftp-structure/
└── .github/
    └── workflows/

4. Create the Workflow File

Inside the workflows folder, create a new file called ensembl-beta-ftp-structure.yml. Open the file in your favorite text editor and paste the following YAML content.

name: Update FTP Folder Structure

on:
  schedule:
    # Runs every 30 minutes (UTC)
    - cron: '0 15 * * *'
  workflow_dispatch:

jobs:
  update_structure:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v3
      with:
        persist-credentials: false

    - name: Install dependencies
      run: |
        sudo apt-get update
        sudo apt-get install -y lftp
        pip install --upgrade pip

    - name: Fetch FTP folder structure using ls -R
      env:
        FTP_SERVER: "ftp.ebi.ac.uk"  # Replace with your FTP server if needed
        FTP_PATH: "/pub/ensemblorganisms"  # Replace with desired FTP path
      run: |
        lftp -c "open $FTP_SERVER; cd $FTP_PATH; ls -R" > ensembl-beta_ftp_structure.txt

    - name: Convert structure to JSON
      run: |
        python3 << 'EOF'
        import json

        def parse_ls_output(file_path):
            tree = {}
            current_path = []  # Current directory path (list)

            with open(file_path, "r") as f:
                for line in f:
                    line = line.rstrip("\n")
                    if not line:
                        continue
                    # Directory header (lines ending with ':')
                    if line.endswith(":"):
                        header = line[:-1]
                        # If header is the root, set current_path to empty list
                        if header == ".":
                            current_path = []
                        else:
                            # Remove the "./" prefix if present, then split path by "/"
                            if header.startswith("./"):
                                header = header[2:]
                            current_path = header.split("/")
                        continue

                    # The ls output is in the format: permissions, link count, owner, group, size, date, time, filename.
                    # Since filenames may contain spaces, join fields from the 9th element onward.
                    parts = line.split()
                    if len(parts) < 9:
                        continue  # Skip unexpected format lines
                    filename = " ".join(parts[8:])
                    # If the first field starts with 'd', it is a directory.
                    is_dir = parts[0].startswith("d")

                    # Navigate through the tree based on the current path.
                    node = tree
                    for part in current_path:
                        node = node.setdefault(part, {})
                    # Add the file or directory entry.
                    if is_dir:
                        node[filename] = {}
                    else:
                        node[filename] = None
            return tree

        if __name__ == "__main__":
            input_file = "ensembl-beta_ftp_structure.txt"  # File generated by lftp
            output_file = "ensembl-beta_ftp_structure.json"

            directory_tree = parse_ls_output(input_file)

            with open(output_file, "w") as out_f:
                json.dump(directory_tree, out_f, indent=4)

            print(f"Directory structure saved to {output_file}.")
        EOF

    - name: Generate information file
      run: |
        python3 << 'EOF'
        import json

        # Load the JSON file containing the folder structure.
        with open("ensembl-beta_ftp_structure.json", "r") as f:
            data = json.load(f)

        # Prepare the header and rows.
        ls_species = ["Species\tGenome"]
        ls_total = ["Species\tGenome"]
        species_num = 0
        genome_num = 0

        for species, genome in data.items():
            # Pass the "test" folder
            if species == "test":
                continue
            # Count genome IDs that start with "GCA_" or "GCF_"
            count = sum(1 for genome_id in genome.keys() if genome_id.startswith("GCA_") or genome_id.startswith("GCF_"))
            ls_species.append(f"{species}\t{count}")

            species_num += 1
            genome_num += count
        # Total number of species and genomes
        ls_total.append(f"{species_num}\t{genome_num}")

        # Write output tables
        with open("ensembl-beta_count_species.txt", "w") as out_file:
            out_file.write("\n".join(ls_species))
        with open("ensembl-beta_count_total.txt", "w") as out_file:
            out_file.write("\n".join(ls_total))
        EOF

    - name: Commit and push changes
      run: |
        git config --global user.name "keun-hong"
        git config --global user.email "thsrms9216@gmail.com"
        if [ -n "$(git status --porcelain ensembl-beta_ftp_structure.json ensembl-beta_count_species.txt ensembl-beta_count_total.txt)" ]; then
          git add ensembl-beta_ftp_structure.txt ensembl-beta_ftp_structure.json ensembl-beta_count_species.txt ensembl-beta_count_total.txt
          git commit -m "Update FTP folder structure and information"
          git push
        else
          echo "No changes to commit."
        fi

5. Use a Personal Access Token (PAT)

Steps to Create and Register Your PAT

  1. Generate a PAT:

  2. Copy the Token:

    • Once generated, copy the token value.
  3. Add the PAT as a Secret in Your Repository:

    • Go to your repository.
    • Click the “Settings” tab.
    • In the left sidebar, click “Secrets and variables” then “Actions”.
    • Click “New repository secret”.
    • Enter MY_PAT in the Name field and paste your token in the Value field.
    • Click “Add secret”.
  4. Update the Workflow’s Push Command:

    • In the workflow file, the push command has been modified to use the secret:

6. Commit and Push Your Workflow File

Save the workflow file and push it to your repository:

git add .github/workflows/ensembl-beta-ftp-structure.yml
git commit -m "Add workflow to update FTP folder structure and information"
git push

7. Verify the Workflow

  • Navigate to your repository on GitHub.
  • Click the “Actions” tab. You should see your “Update FTP Folder Structure” workflow listed.
  • You can manually trigger the workflow using the “Run workflow” button or wait for the scheduled run.

8. Access the Output

After the workflow runs, the following files will be generated in your repository:

  • ensembl-beta_ftp_structure.txt: The raw output from the lftp command.
  • ensembl-beta_ftp_structure.json: The JSON file with the parsed folder structure.
  • ensembl-beta_count_species.txt: A tab-separated table containing the species name and the number of genome IDs .
  • ensembl-beta_count_total.txt: A tab-separated table containing the total number of species and genome IDs .

Since your repository is public, these files are accessible at any time.


By following these steps, you’ve successfully set up GitHub Actions to automatically and regularly update your FTP folder structure data and generate useful summary information. Enjoy exploring the creative uses of GitHub Actions beyond traditional CI/CD tasks!

Leave a comment