AGU_Fall2016
Monday AM 0800-1000
Architecture and integration testbed for Cyberinfrastructure
Initial attendence: low (40-50 people).
1st Speaker: Sara Graves (U Alabama) - EarthCube ECITE (Earth Cube Integration and Test Environment).
Long text slides
Infrastructure for test and evaluation
Easily accessible computational resources.
Promote collaboration and participation.
Need for a testbed environment (2014)
EarthCube 'Workbench'
sounds like a DE
Amazon, OpenStack, Eucalyptus
sites.cloud.gmu.edu/ecite
ecite.dc2.gmu.edu/dc2us2/login.jsp
Attendance picking up (75-100).
2nd Speaker: Scott Farley (U Wisconsin).
Climate and big data
Environmental covariates for Species Distribution Modelling (SDM).
Neotoma = 1.5 million records since 2010
RF, Boosted Regression Trees, Generalized Additive Models. presence/absence models.
Bayesian Additive Regression Trees
Cloud computing using Amazon Webservices.
Choosing optimal hardware for an SDM.
NSF EAR 150707
3rd Speaker Peishi Jiang, UIUC (Illinois)
hcgs.ncsa.illinois.edu
heterogeneity of data and model results and difficultues in seamless data-model integration.
SEAD, CSDMS, Clowder, ESMF, OpenMI
Develop a framework rooted in semantic web technologies.
Knowledge integration service: geosemantics framework.
EMELI-Web http://hcgs.ncsa.illinois.edu/emli-web
4th Speaker: Daniel Duffy (NASA Goddard Space Center)
NASA Center for Climate Simulation (NCCS)
GEOS, GMAO
27km resolution, 27 millino grid points
12 KM
Exabyte size datsets when we move to 200-100 m resolution.
HPC Science cloud
Data Analytics Storage Service (DASS) (~20PB)
CAaaS
MERRA Analytic Services doi: 10.1016/j.compenvurbsys.2013
Create a true centralized combination of stoarge and compute capacity.
a new 20 Pb system cost ~$1 million
Hadoop ecosystem.
Apollo 4520 28 servers 896 cores, 14 Tb Mem, 16Gb/core, 37 Tf compute,
MapReduce, Spark, Machine Learning, RESTful, Cloudera and SIA, Shared Parallel File System (GPFS) and Hadoop Connector.
Centos OS
Software RAID
Linux Storage
SIA = spatiotemporal index approach and Hadoop
Zhenlong Lia 2016 A spatiotemporal indexing approach for efficient processiong of big array-based climate data with MapReduce
Native NetCDF data.
Lustre did not perform as well as the Hadoop and Cloudera. MapReduce query. Results similar.
Serial Performance w/ SIA: GPFS wit MapReduce is fastest (5x speed improvement over Hadoop) w/ native NetCDF
Parallel Performance w/ SIA: similar results, 2x improvement.
5th Speaker: Charuleka Varadharajan (Lawrence Berkeley National Lab)
Managment and assimilation of diverse distributed watershed datasets.
How do mountain watersheds release water, nutrients, carbon and metals?
Agile data approach for user-centred data.
Three databases for project:
ifrcrifle.org
easteriver.dev.subsurfaceinsights.com
ggkbase.berkeley.edu
Basin-3D, Python, Django, Bitbucket.
watershed.lbl.gov
6th Speaker: Zhenlong Li
Largescale parallel processing lidar data
Domain based vs Tile-based
Standard lidar slide - big data, billions of points, ArcGIS sucks.
Distributed storage and computing cluster (Hadoop)
maximize load balancing, parallelization and data locality
Tile-based structure for all of SC.
HDFS shares tiles by spatial index (geographic-space reference) across nodes.
Seems like a job for cctools
gis.cas.sc.edu/lidar
Hadoop cluster v. 2.6
Improved performance linearly by scaling up with adidtional nodes...
7th Speaker: Milton Halem (University of Maryland)
Future accelerated cognitive distributed hybrid testbed for big data science analytics
This whole talk is limited to discussing the hardware they just bought sigh
IBM IDataplex cluste: Bluewave
CentOS
7 years old.
CHMPR
IBM 200 petaflop 'IBM Minsky' nodes
10 cores 3.0Ghz
4 NVIDIA P100
P100s are 2x faster in 16 bit mode versus double float 32 bit.
2 NVIDIA NVlink
D-Wave 2x
1152 qubits
D-Wave 3x announced Q1 2017
8th Speaker: David Fulker (OpenDAP)
EarthCube Interoperability Workbench
Comic Sans (seriously!)
Notebook technology
Cited Nature article up front -brave
Show how its SoS (system of system) facilitiates:
interoperability across geoscience
reproducability of geoscience research
inegrated user engaged presence
Interactive notebooks in R Markdown or iPython.
offer multi-level user interfaces
foster human-mediated interoperability
foster a collection of workflows-as-notebooks
RStudio, Jupyter
such notebooks may serve as reproduceable, citable artifacts of EC.
Opensource texutal representations with a DOI.
Monday AM 1020 - 1200
Big Data Analytics I
Well attended (200+, standing room only)
Chaired by NASA folks
1st Speaker: Dawn Wright (ESRI)
Hybrid approach for Earth Science and Real-time decision making
OpenSource vs Proprietary
Feature Geo Analytics
Processing large datastreams in cloud environments
Designed to perform both spatial and temporal analytics
Work with existing GIS data and table data
ESRI is looking to change culture and work with OpenSource community
Distributed analytics and storage
AWS cluster available.
Python notebooks.
Spark, Impala, YARN, HDFS,
ArcGIS <-->ArcPy<--> ImpylaHDFS
Python, Java, Scala.
Spatialtemporal BigDataStore (Bigfoot)
geonet.esri.com
Free to Universitys using ESRI sight license
github.com/mraad.
2nd Speaker: Jed Sundwall (Amazon)
Democratizing access to cloud data services
AWS makdes it accessible to a large and growing community of researchers and entrepenuirs
Cleaning up data is often 80%+ of the work.
"undifferentiated heavy lifting"
S3 --> EC2 --> Client
bring algorithms to the data.
3k Rice Genome,
LANDSAT8 - up to a Pb on the cloud.
USGS --> .tar --> EC2 --> .tif --> S3://landsat-pds
Amazon pre-processes the scenes
Make .tif available
GDAL internal tiling on AWS tiff, allows to use HTTP range for specific access.
efficient query
Landsat on AWS
increased download availability to 75 sec per scene from over 400 sec
ObservedEarth iPhone app (viz app) - processed 200Tb on a phone.
Software that take advantage of cloud based rasters torage:
gdal.org
landsatonaws.com
mapbox.github.io/rasterio
github.com/sat-utils
NEXTRAD on AWS
NEXTRAD > EC2 realtime chuncs > S3 > EC2 > S3
230% increase in data access after moving to AWS.
Reduced NCEI servers by 50%
latent demand for these data (now trivial to acquire, lots to do!)
Sentinal2
aws.amazon.com/public-datasets/sentinal-2
Amazon's spatial data
aws.amazon.com/earth/research-credits
3rd Speaker: Niall Robinson (UK Met Office Informatics Lab)
Browser-based notebooks for climate research
400Tb data a day, 1 exabyte of data in storage
JADE
implemented in AWS
OpenSource products glued together
Jupyter + Docker + DASK
DASK - python based, works well for them.
Bounced away from Hadoop and PySpark.
scalability is achieved via AWS
'lazy' jobs - handled by DASK
different chunks of data on compute nodes that are brought back together.
only pull the data they need to the compute nodes.
S3 is parallel
Headed to Posters.
Evening Keynote
AGU is focusing on data.
Will continue to increase exposure and focus on data into the future.
Rebecca Moore, Director of Google Earth, EarthEngine, and Outreach.
Initiated and leads outreach on Engine.
Conceived and leads Google's outreach program on conservation, human rights, and conservation.
EE started in Amazon - detecting illegal logging.
4M images, 42 years LANDSAT
All data was stored on tapes at EROS in Nebraska
EE Timelapse
Global Forests - first majer EE pub.
Hansen et al. 2013
654k, 700 tera pix, 1mil hours, 10,000 cpu, 4 days.
$392 mil data cost before usgs/nasa data cost change
"often it turns out to be more efficient to move the questions than to move the data" -Jim Gray 4th Paradigm.
Bandwidth is expeeisive, so co-locate CPU and disk.
Disk is cheap, bring it online,
CPU is expensive, so don't preprocess needlessly
256 px tiled rasters. parallelized and reassembled using EE API.
EE example - 4 lines of code, cloud free extent.
EE example - nighttime lights trend.
Neural Network Machine Learning w/ Remote Sensing.
EE example - fishing in Asia (20% fish are illegall caught). Global Fishing Watch. AIS data on most ships. 22m positions daily. "Ending Hide and Seek at Sea" Science Mar. 2016.
Labelling patterns to detect trawl, longline, purse seine.
Coastal Risk Vanuatu (Dec 14 launch) - future coastal inundation due to sea level rise.
Climate Engine Beta (Desert Research Institute, U of Idaho) - using LANDSAT vs MODIS. Huntington and Abatzoglou. ROSES award from NASA. Use an ensemble ET modelling approach.
Mapping 30 years of Global Surface Water Occurance and Change. Nature (2016). NYT press last week. Largest analysis on EE
3 mil scenes, 1.8 tpixel, 6 mil hours, 10k cpus, 45 days.
global-surface-water.appspot.com
Federated systems for distributed source data archival in maste repositories (!000s of htese) eg.a NASA/USGS,NOAA, ESA, etc.
Cloud-based systems which aggregate and host mirror-copy of all big data in "analysis-ready" form.
Co-located with massive computational resources for processing. (perhaps tens of htese worldwide) e.g. Google, Amazon.
Still need incentives for scientists to collaborate, and share scientific data and code.