Team Pickle Rick
GitHub Repository: https://github.com/evan-mcginnis/INFO529-midterm
Docker (Container): https://hub.docker.com/repository/docker/emcginnis/info529
Github Website/Write-up: https://ianforsyth1.github.io/ISTA429-Midterm/
Google Document Answering the 10 Simple Rules to Cultivate Transdisciplinary Collaboration in Data Science: https://docs.google.com/document/d/1SoN3xQI8gXD-f0s9SmePk9jtNsvu5lT1yyz4gLLYuDA/edit?usp=sharing
Presentation Slide Show: https://docs.google.com/presentation/d/1fhYI1ss_TzTWNZWb8D3U0waaf5MbdILuLbBC8FQUSTg/edit?usp=sharing
Members: Brendan Bertone, Jocelyn Connors, Evan McGinnis, Hongyuan Zhang, Ian Forsyth
Team Spokesman: Ian Forsyth
Steps Taken to Complete Competition
Created a Github Website that is able to host the report for the competition.
Cloned/forked GitHub Repository and executed git pulls, commits, and upstreams to make sure everyone was always working on the most current version of the master repository.
Able to manipulate/clean data, store data (Cyverse), and access data in HPC.
Able to create a summary python file that was able to graph and model the features for the data.
Able to Connect to HPC run code in interactive mode
Was able to run code with and without GPU. It was a necessity to increase memory in interactive mode or our prediction-yield python script would get killed.
Interactive mode was initiated by specifying: srun --nodes=1 --mem=470GB --ntasks=1 --cpus-per-task=94 --time=6:00:00 --job-name=po_mcomp --account=lyons-lab --partition=standard --mpi=pmi2 --pty bash -i
srun --gres=gpu:1 --nodes=1 --mem=470GB --ntasks=1 --cpus-per-task=94 --time=6:00:00 --job-name=po_mcomp --account=lyons-lab --partition=standard --mpi=pmi2 --pty bash -i
Used conda in our virtual environment and activated it by using the command ‘conda activate midterm’ after in interactive mode.
Created a Batch file that was able to run slurm job on HPC that predicted results for the training data using our prediction-yield python script. The script also output the prediction numpy file for the training data.
Was able to modify the batch file by using vi to edit the file to have 16 nodes, cpus per task to equal 12, and the time to be set for 10:00:00.
Was able to submit the job on the HPC (sbatch).
Utilized makefile in the command line for shell commands to make calling specific files easier (i.e. make data, make train-data, etc.)
Was able to produce prediction results for training data with a decent(RMSE and R-Squared values).
Incorporating containers/docker for running outcode and being able to connect the container to the HPC.
Member Skills
Evan McGinnis (Overall Project Manager/Data Engineer/Coder): Python, Various DBs, Project Management, AWS, Azure, Docker
Brendan Bertone (HPC Team): Information Technology Major/CS Minor, Data Mining Techniques (R, Tidyverse), Python/Java/HTML/CSS/JavaScript, Cyber Security Concepts, Networking Concepts
Jocelyn Connors (HPC Team): R, SQL, experience in Microsoft Azure and MS Access, Excel, PowerPoint, Basic Python, Data Mining Techniques (R, Tidyverse), Data Visualization
Hongyuan Zhang (Write Up/Github Website Team): HTML, python, C, basic R programming, java, Linux
Ian Forsyth (Write Up/Github Website Team): Information Technology (Enterprise setting), Some Python and R programming, Some HTML/CSS, Some SQL, Cyber Security Concepts, MS Office Suite.
Gaps in Knowlege
Plant Science, Possibly Machine Learning