Team Pickle Rick

Team Pickle Rick

GitHub Repository: https://github.com/evan-mcginnis/INFO529-midterm

Docker (Container): https://hub.docker.com/repository/docker/emcginnis/info529

Github Website/Write-up: https://ianforsyth1.github.io/ISTA429-Midterm/

Google Document Answering the 10 Simple Rules to Cultivate Transdisciplinary Collaboration in Data Science: https://docs.google.com/document/d/1SoN3xQI8gXD-f0s9SmePk9jtNsvu5lT1yyz4gLLYuDA/edit?usp=sharing

Presentation Slide Show: https://docs.google.com/presentation/d/1fhYI1ss_TzTWNZWb8D3U0waaf5MbdILuLbBC8FQUSTg/edit?usp=sharing

Members: Brendan Bertone, Jocelyn Connors, Evan McGinnis, Hongyuan Zhang, Ian Forsyth

Team Spokesman: Ian Forsyth

Steps Taken to Complete Competition

  • Created a Github Website that is able to host the report for the competition.

  • Cloned/forked GitHub Repository and executed git pulls, commits, and upstreams to make sure everyone was always working on the most current version of the master repository.

  • Able to manipulate/clean data, store data (Cyverse), and access data in HPC.

  • Able to create a summary python file that was able to graph and model the features for the data. 

  • Able to Connect to HPC run code in interactive mode

    • Was able to run code with and without GPU. It was a necessity to increase memory in interactive mode or our prediction-yield python script would get killed.

      • Interactive mode was initiated by specifying: srun --nodes=1 --mem=470GB --ntasks=1 --cpus-per-task=94 --time=6:00:00 --job-name=po_mcomp --account=lyons-lab --partition=standard --mpi=pmi2 --pty bash -i

      • srun --gres=gpu:1 --nodes=1 --mem=470GB --ntasks=1 --cpus-per-task=94 --time=6:00:00 --job-name=po_mcomp --account=lyons-lab --partition=standard --mpi=pmi2 --pty bash -i

    • Used conda in our virtual environment and activated it by using the command ‘conda activate midterm’ after in interactive mode.

  • Created a Batch file that was able to run slurm job on HPC that predicted results for the training data using our prediction-yield python script. The script also output the prediction numpy file for the training data.

  • Was able to modify the batch file by using vi to edit the file to have 16 nodes, cpus per task to equal 12, and the time to be set for 10:00:00. 

  • Was able to submit the job on the HPC (sbatch).

  • Utilized makefile in the command line for shell commands to make calling specific files easier (i.e. make data, make train-data, etc.)

  • Was able to produce prediction results for training data with a decent(RMSE and R-Squared values).

  • Incorporating containers/docker for running outcode and being able to connect the container to the HPC.

Member Skills

Evan McGinnis (Overall Project Manager/Data Engineer/Coder): Python, Various DBs, Project Management, AWS, Azure, Docker

Brendan Bertone (HPC Team): Information Technology Major/CS Minor, Data Mining Techniques (R, Tidyverse), Python/Java/HTML/CSS/JavaScript, Cyber Security Concepts, Networking Concepts

Jocelyn Connors (HPC Team): R, SQL, experience in Microsoft Azure and MS Access, Excel, PowerPoint, Basic Python, Data Mining Techniques (R, Tidyverse), Data Visualization

Hongyuan Zhang (Write Up/Github Website Team): HTML, python, C, basic R programming, java, Linux

Ian Forsyth (Write Up/Github Website Team): Information Technology (Enterprise setting), Some Python and R programming, Some HTML/CSS, Some SQL, Cyber Security Concepts, MS Office Suite.

Gaps in Knowlege

Plant Science, Possibly Machine Learning