AGU_Fall2016

AGU_Fall2016

Monday AM 0800-1000

Architecture and integration testbed for Cyberinfrastructure

Initial attendence: low (40-50 people).

1st Speaker: Sara Graves (U Alabama) - EarthCube ECITE (Earth Cube Integration and Test Environment). 

  • Long text slides

    • Infrastructure for test and evaluation

    • Easily accessible computational resources.

    • Promote collaboration and participation. 

  • Need for a testbed environment (2014)

  • EarthCube 'Workbench'

    • sounds like a DE

  • Open Science Framework

  • Amazon, OpenStack, Eucalyptus

  • sites.cloud.gmu.edu/ecite

  • ecite.dc2.gmu.edu/dc2us2/login.jsp

Attendance picking up (75-100).

2nd Speaker: Scott Farley (U Wisconsin).

  • Climate and big data

  • Environmental covariates for Species Distribution Modelling (SDM).

  • Neotoma = 1.5 million records since 2010

  • RF, Boosted Regression Trees, Generalized Additive Models. presence/absence models. 

  • Bayesian Additive Regression Trees

  • Cloud computing using Amazon Webservices. 

  • Choosing optimal hardware for an SDM.

  • NSF EAR 150707

3rd Speaker Peishi Jiang, UIUC (Illinois)

4th Speaker: Daniel Duffy (NASA Goddard Space Center)

  • NASA Center for Climate Simulation (NCCS)

  • GEOS, GMAO

  • 27km resolution, 27 millino grid points

  • 12 KM 

  • Exabyte size datsets when we move to 200-100 m resolution.

  • HPC Science cloud

  • Data Analytics Storage Service (DASS) (~20PB)

  • CAaaS

  • MERRA Analytic Services doi: 10.1016/j.compenvurbsys.2013

  • Create a true centralized combination of stoarge and compute capacity.

  • a new 20 Pb system cost ~$1 million

  • Hadoop ecosystem.

  • Apollo 4520 28 servers 896 cores, 14 Tb Mem, 16Gb/core, 37 Tf compute,

  • MapReduce, Spark, Machine Learning, RESTful, Cloudera and SIA, Shared Parallel File System (GPFS) and Hadoop Connector.

  • Centos OS

  • Software RAID

  • Linux Storage

  • SIA = spatiotemporal index approach and Hadoop

  • Zhenlong Lia 2016 A spatiotemporal indexing approach for efficient processiong of big array-based climate data with MapReduce

  • Native NetCDF data.

  • Lustre did not perform as well as the Hadoop and Cloudera. MapReduce query. Results similar. 

  • Serial Performance w/ SIA: GPFS wit MapReduce is fastest (5x speed improvement over Hadoop) w/ native NetCDF

  • Parallel Performance w/ SIA: similar results, 2x improvement.

5th Speaker: Charuleka Varadharajan (Lawrence Berkeley National Lab)

  • Managment and assimilation of diverse distributed watershed datasets.

  • How do mountain watersheds release water, nutrients, carbon and metals?

  • Agile data approach for user-centred data.

  • Three databases for project:

    • ifrcrifle.org

    • easteriver.dev.subsurfaceinsights.com

    • ggkbase.berkeley.edu

  • Basin-3D, Python, Django, Bitbucket.

  • watershed.lbl.gov

6th Speaker: Zhenlong Li

  • Largescale parallel processing lidar data

  • Domain based vs Tile-based

  • Standard lidar slide - big data, billions of points, ArcGIS sucks.

  • Distributed storage and computing cluster (Hadoop) 

  • maximize load balancing, parallelization and data locality

  • Tile-based structure for all of SC.

  • HDFS shares tiles by spatial index (geographic-space reference) across nodes.

    • Seems like a job for cctools

    • gis.cas.sc.edu/lidar

    • Hadoop cluster v. 2.6

    • Improved performance linearly by scaling up with adidtional nodes...


7th Speaker: Milton Halem (University of Maryland)

  • Future accelerated cognitive distributed hybrid testbed for big data science analytics

  • This whole talk is limited to discussing the hardware they just bought sigh

  • IBM IDataplex cluste: Bluewave

    • CentOS

    • 7 years old.

  • CHMPR

    • IBM 200 petaflop 'IBM Minsky' nodes

      • 10 cores  3.0Ghz 

      • 4 NVIDIA P100

        • P100s are 2x faster in 16 bit mode versus double float 32 bit.

      • 2 NVIDIA NVlink

  • D-Wave 2x

    • 1152 qubits

    • D-Wave 3x announced Q1 2017

8th Speaker: David Fulker (OpenDAP)

  • EarthCube Interoperability Workbench

  • Comic Sans (seriously!)

  • Notebook technology

  • Cited Nature article up front -brave

  • Show how its SoS (system of system) facilitiates:

    • interoperability across geoscience

    • reproducability of geoscience research

    • inegrated user engaged presence

  • Interactive notebooks in R Markdown or iPython.

    • offer multi-level user interfaces

    • foster human-mediated interoperability

    • foster a collection of workflows-as-notebooks

  • RStudio, Jupyter

    • such notebooks may serve as reproduceable, citable artifacts of EC.

    • Opensource texutal representations with a DOI. 

Monday AM 1020 - 1200

Big Data Analytics I

Well attended (200+, standing room only)

Chaired by NASA folks

1st Speaker: Dawn Wright (ESRI)

  • Hybrid approach for Earth Science and Real-time decision making

  • OpenSource vs Proprietary

  • Feature Geo Analytics

    • Processing large datastreams in cloud environments

    • Designed to perform both spatial and temporal analytics

    • Work with existing GIS data and table data

  • ESRI is looking to change culture and work with OpenSource community

  • Distributed analytics and storage 

  • AWS cluster available.

  • Python notebooks.

  • Spark, Impala, YARN, HDFS,

  • ArcGIS <-->ArcPy<--> ImpylaHDFS

  • Python, Java, Scala.

  • www.esri.com/software/open

  • Spatialtemporal BigDataStore (Bigfoot)

  • geonet.esri.com

  • Free to Universitys using ESRI sight license

  • github.com/mraad.

2nd Speaker: Jed Sundwall (Amazon)

  • Democratizing access to cloud data services

  • AWS makdes it accessible to a large and growing community of researchers and entrepenuirs

  • Cleaning up data is often 80%+ of the work.

    • "undifferentiated heavy lifting"

  • S3 --> EC2 --> Client

  • bring algorithms to the data.  

  • 3k Rice Genome, 

  • LANDSAT8 - up to a Pb on the cloud.

    • USGS --> .tar --> EC2 --> .tif --> S3://landsat-pds

    • Amazon pre-processes the scenes 

    • Make .tif available 

    • GDAL internal tiling on AWS tiff, allows to use HTTP range for specific access. 

      • efficient query

    • Landsat on AWS

      • increased download availability to 75 sec per scene from over 400 sec

  • ObservedEarth iPhone app (viz app) - processed 200Tb on a phone.

  • Software that take advantage of cloud based rasters torage:

    • gdal.org

    • landsatonaws.com

    • mapbox.github.io/rasterio

    • github.com/sat-utils

  • NEXTRAD on AWS

    • NEXTRAD > EC2 realtime chuncs > S3 >  EC2 > S3

    • 230% increase in data access after moving to AWS.

    • Reduced NCEI servers by 50%

      • latent demand for these data (now trivial to acquire, lots to do!)

  • Sentinal2

    • aws.amazon.com/public-datasets/sentinal-2

  • Amazon's spatial data

    • aws.amazon.com/earth/research-credits

3rd Speaker: Niall Robinson (UK Met Office Informatics Lab)

  • Browser-based notebooks for climate research

  • 400Tb data a day, 1 exabyte of data in storage

  • JADE

    • implemented in AWS

    • OpenSource products glued together

    • Jupyter + Docker + DASK

      • DASK - python based, works well for them.

      • Bounced away from Hadoop and PySpark.

  • scalability is achieved via AWS

  • 'lazy' jobs - handled by DASK

    • different chunks of data on compute nodes that are brought back together.

    • only pull the data they need to the compute nodes.

  • S3 is parallel

 

Headed to Posters.

 

Evening Keynote

 

AGU is focusing on data.
Will continue to increase exposure and focus on data into the future.

Rebecca Moore, Director of Google Earth, EarthEngine, and Outreach.

Initiated and leads outreach on Engine.

Conceived and leads Google's outreach program on conservation, human rights, and conservation.

EE started in Amazon - detecting illegal logging.

4M images, 42 years LANDSAT
All data was stored on tapes at EROS in Nebraska

EE Timelapse

Global Forests - first majer EE pub.
Hansen et al. 2013
654k, 700 tera pix, 1mil hours, 10,000 cpu, 4 days.
$392 mil data cost before usgs/nasa data cost change

"often it turns out to be more efficient to move the questions than to move the data" -Jim Gray 4th Paradigm.

Bandwidth is expeeisive, so co-locate CPU and disk.

Disk is cheap, bring it online,

CPU is expensive, so don't preprocess needlessly

256 px tiled rasters. parallelized and reassembled using EE API.

EE example - 4 lines of code, cloud free extent.
EE example - nighttime lights trend.

Neural Network Machine Learning w/ Remote Sensing.
EE example - fishing in Asia (20% fish are illegall caught). Global Fishing Watch. AIS data on most ships. 22m positions daily. "Ending Hide and Seek at Sea" Science Mar. 2016.

Labelling patterns to detect trawl, longline, purse seine.

Coastal Risk Vanuatu (Dec 14 launch) - future coastal inundation due to sea level rise.

Climate Engine Beta (Desert Research Institute, U of Idaho) - using LANDSAT vs MODIS. Huntington and Abatzoglou. ROSES award from NASA. Use an ensemble ET modelling approach.

Mapping 30 years of Global Surface Water Occurance and Change. Nature (2016). NYT press last week. Largest analysis on EE
3 mil scenes, 1.8 tpixel, 6 mil hours, 10k cpus, 45 days.

global-surface-water.appspot.com

Federated systems for distributed source data archival in maste repositories (!000s of htese) eg.a NASA/USGS,NOAA, ESA, etc.

Cloud-based systems which aggregate and host mirror-copy of all big data in "analysis-ready" form.
Co-located with massive computational resources for processing. (perhaps tens of htese worldwide) e.g. Google, Amazon.

Still need incentives for scientists to collaborate, and share scientific data and code.