To Do - BIEN db
Overview - Quantitative Validation Status
Task | Status | Completed |
|---|---|---|
Specimens |
|
|
input queries | written, troubleshooting NY | 3/27, |
output queries | written, troubleshooting NY | 4/3, |
pipeline | runs |
|
sources validated |
|
|
Plots |
|
|
input queries |
|
|
output queries | written | 3/18 |
pipeline |
|
|
sources validated |
|
|
FIA, SALVIAS by Brad | waiting on fixes from Aaron and data reload |
|
VegBank by Mike (as proxy for plots) | in progress |
|
Traits |
|
|
data validated | complete | 2/27 |
2014.05.08
To Do
Aaron
Send Bob a CSV file with the results of the taxon name queries (described below) by today or tomorrow so he can determine whether or not the names are correctly formed.
The two taxon name queries check that Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Bob will check the VegBank taxon names.
Continue work on the make file for building a clean install of the BIEN db.
Keep us informed of progress -- especially Nick and Mark so they can advise you.
This is part of the larger issue of diagnosing and fixing the disk space leak.
Bob
Review how the taxon names are being formed. (within a week)
Mike
Continue with the VegBank validations.
Let us know if you need something or are waiting for something.
Please correspond using the bien-db@iplantcollaborative.org mailing list so everyone is kept in the loop.
Martha
Find out Brian's availability through the summer.
Find out which datasets Brian needs the taxon name results from so he can help validate that taxon names are correctly formed.
2014.05.01
To Do
Aaron
By end of the day on Friday:
Create a query to check that Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Do this for VegBank taxon names. Instead of sending the output results to Bob to review, Aaron was waiting for Mike to write corresponding queries against VegBank for comparison purposes.
Run the plot output queries on the VegBank data in VegBIEN. Let Mike know where the results tables are and give him access to them.
Bob
Review how the taxon names are being formed. (within a week)
Mike
Compare the results of the queries Aaron runs to queries you run directly against VegBank. (will take a while; keep us informed as you progress)
2014.04.17
To Do for Aaron
Top priority: Aaron to contact Nick this morning about the disk space leak. Aaron should work with Nick to resolve the problem since it is a system issue. Copy Mark. Let us know as soon as that is resolved.
Specimens Query 15:
Schema change required to add locationRemarks.
Then reload the data to get it in that field.
Fix the queries.
Run the diff.
Specimens Query 13 (has problems to troubleshoot)
Aaron needs to fix the input filter to just filter out the projects Brad listed at the top of the MySQL queries. He'll then need to reimport the data before the SALVIAS queries can be expected to generate correct output.
FIA: Aaron will follow Brad's suggestion to bypass Aaron's pipeline for filtering when loading FIA data into VegBIEN.
FIA: Brad will write scripts (that run after FIA data are loaded into VegBIEN) to filter and build indexes on the analytical tables.
2014.04.10
To Do for Aaron
add taxonverbatim.subspecies
add TNRS and geoscrub tables to ERD
specimens aggregating validations
fix input queries #4,5: remove subspecies IS NOT NULL filter
fix input queries #6,7: add subspecies IS NOT NULL filter
fix output queries #6,7: use subspecies instead of the concatenated taxonomic name
2014.04.03
Specimen output queries are written.
To Do (Aaron)
implement workaround for the slowdown in query #12
run pipeline on NY to generate diffs
write denormalized plots input queries, using VegBank as the example datasource
finish fixing plots output queries
SALVIAS
denormalized plots datasources: VegBank, CVS, CTFS
specimens
FIA (special case, with separate input queries)
normalized plots datasources: TEAM, Madidi### denormalize
validate
2014.03.27
Specimen input queries are written.
Decisions
plots aggregating validations
won't denormalize SALVIAS because already have input queries for it (Brad)
validate FIA last because it's a special case (Brad)
specimens aggregating validations
OK to run NY validations when writing specimens output queries instead of at the end with the other specimens datasources (Brad)
when writing specimens output queries based on NY input queries, treat query name as authoritative rather than query implementation (Brad)
use taxonoccurrence as the main specimen table
use concatenated taxon name instead of concatenating the ranks, since not all specimens datasources provide the ranks
new-style import
needs to include the denormalization of normalized datasources
NY
use artificial key as pkey instead of removing rows that are missing an accessionNumber (Brad)
To Do for Aaron
aggregating validations
finish specimens output queries
use concatenated taxon name instead of concatenating the ranks
in #1, use taxonoccurrence instead of location as the main specimen table
run specimens output queries on NY to test them
denormalize normalized plots datasources: TEAM, Madidi
write denormalized plots input queries
finish fixing plots output queries
validate plots datasources: SALVIAS, VegBank, CVS, TEAM, Madidi, CTFS, FIA
validate specimens datasources
new-style import
denormalize normalized plots datasources: TEAM, Madidi, later SALVIAS
NY
use artificial key as pkey instead of accessionNumber
FIA
remap what's mapped to locationName to a suffix of it instead (locationName itself corresponds to plotCode below)
map INVYR to a suffix of authorEventCode (authorEventCode itself corresponds to plotCensusCode below)
add plotCode, plotCensusCode derived columns:
plotCode = CONCAT_WS("_",STATECD,COUNTYCD,PLOT)
plotCensusCode = CONCAT_WS("_",STATECD,COUNTYCD,PLOT,INVYR)mapped to locationName, authorEventCode
2014.03.18
All plot output queries are written.
To Do for Aaron
1. Write the specimen output queries
2. Complete all plot input queries, including modifying Brad's FIA input queries
3. Write the specimen input queries
4. Validate all plot datasets
5. Validate all specimen datasets
2014.03.13
Decisions
for attribution purposes, every location should be linked to a dataset, which is the first publisher/primary data provider
to simplify the plots validation queries, locations in a plots datasource should also be linked to a project (the project that defines the plot)
this requires a single pseudo-project for plots datasources that don't use projects, which is not ideal (Mark)
don't address issues (diffs) until working on input queries (Brad)
To Do
for Brad
schema changes
write provenance use cases
for Aaron
schema changes
write up changes needed
rename VegCore specimenHolderInstitutions to specimen_duplicate_institutions
rename specimenreplicate.institution_id to duplicate_institutions_id
add dataset table
link project to dataset (for all our current datasources, effectively 1:1)
link location to dataset (the first publisher/primary data provider)
link location to project (the project that defines the plot)
implement new query #19
write specimens output queries
mappings changes
populate location.dataset for all datasources
for aggregators, create one for each first publisher/primary data provider
for plots with projects, create one for each project
for herbarium specimens and plots without projects, create one for the datasource itself
make location.dataset required
populate project.dataset for all plots datasources
for all our current datasources, project is 1:1 with dataset
make project.dataset required
2014.03.06
To Do (Aaron)
Finish all plot output queries. Make sure they work and are stable. This is a single set of 18 queries modeled after the SALVIAS queries Brad provided.
Write specimens output queries. Make sure they work and are stable. This is a single set of 16 queries modeled after the NYBG queries Brad provided.
Write plots input queries against the data in the denormalized VegCore schema. Make sure they work and are stable. This is a single set of 18 queries that match the plot output queries.
Write specimens input queries against the specimen data in the vegCore schema. Make sure they work and are stable. This is a single set of 16 queries that match the specimen output queries.
Run plot validations for two plot data sets that have already been denormalized, excluding FIA (VegBank, CVS).
Run all specimen validations.
Denormalize the remaining plot datasources TEAM, Madidi, and CTFS so they will work with the denormalized plot input queries.
Complete plot validations on TEAM, Madidi, and CTFS.
As you work on completing the queries, please send Brad each day the most recent version of the queries you are working on (plot output, specimen output, plot input or specimen input) so he can gauge progress and help with any problems you might be encountering.
For Brad
Brad will re-number the FIA queries so they correspond with the other 18 queries.
Brad will validate FIA as well as SALVIAS.
2014.02.27
Trait validations completed.
decisions
SALVIAS aggregating validations
implement and test queries in the order they are numbered, instead of writing the blank queries first (Brad)
write blank queries if they are the next query in the sequence
test queries outside of the pipeline first
traits data (to be done by end of March)
add TraitObservation unmapped columns to the trait table rather than mapping them to party (Brad)
OK to just add new columns to the version-controlled schema, and let them get added to the live DB in the next full-database import
Brad will load new data to bien2_staging.TraitObservation, rather than us importing it directly into VegBIEN from the CSVs, in the interest of time (Brad)
to do for Aaron
schema changes
rename location.top_plot -> plot_location_id
in the database, the triggers, and the FAQ
send e-mail that top_plot renamed
rename top_plot view -> plot
SALVIAS aggregating validations
test each query outside of the pipeline, in order
send e-mail that queries written
implement each query in the pipeline
traits data (to be done by end of March)
add TraitObservation unmapped columns to the trait table
reload staging tables from bien2_staging.TraitObservation once Brad has updated it to the new traits data
TNRS client
add new source that Brad is adding
2014.02.20
Decisions
Priorities for Aaron
1) Get the pipeline finished and working with all the sources.
2) Document problems uncovered by the quantitative validation scripts.
3) Don’t begin the fixes until the entire pipeline is in place.
4) Due to time limitations, some triaging of problems revealed by the quantitative validation process will need to be done.
Problem
It's unclear who will be available to work with Aaron to help him fix any problems that are uncovered by the validation scripts.
To Do
Aaron
aggregating validations
get pipeline working
fix datasource-selection bug on output side
Martha
Look into getting the iPlant assistance for Brad on TNRS.
2014.02.13
Decisions
Work to make validation results accessible to the data provider as CSV files should be done LATER, not now, and is part of Brad's UI work, not Aaron's database development work.
The queries that result in a count would be a useful to report in a logfile. The data provider could look at the numeric values and would know if they were correct or not. Report the column header and count.
This would help (in part) address our concerns about validating against the staging tables instead of the original data source.
To Do
Aaron
Diagram
Add details to the quantitative validation workflow diagram as discussed, including links to examples. (DONE)
Development
Complete implementation of the input pipeline and table join (diff)
Send results from Traits to the group.
Then move on to SALVIAS and other data sources (if time permits.)
2014.02.06
Decisions
Aaron will use SQL to compare the records instead of a diff of text files.
If one query takes more than 10 minutes, add a record limit (5000 or 10,000), but keep the number of records as large as possible.
In the interest of time, Brad will keep the cultivated specimen data sources in a separate database outside of BIEN.
To Do
Aaron
Make a high level diagram to document the validation pipeline.(DONE)
Continue work on the Traits validation pipeline; run it and let us know how the validations came out.
For traits, remove the empty rows (pre-delete them) that resulted from records that lacked a taxon name.
Then adjust the queries.
Reminder to contact Brad promptly with any questions or if problems arise. Don't wait for the Thursday call.
Brad
Continue work on cultivated specimens and taxonomy.
Brad and Aaron
Reminder to send emails to the entire group so everyone is in the loop and can help.
2014.01.30
Decisions
(Not a decision, but on the table) Due to Mike's time constraints, we may have to consider doing the VegBank and CVS quantitative validations against the VegCore schema instead of against the original databases since so much careful spot-checking of these sources has already been done.
To Do
Quantitative Validations
Aaron: (1) For the quantitative validation pipeline, begin with traits:
Edit Brad’s input queries for traits to use vegCore names.
Complete Brad’s taxon name related queries (5,6,8).
Send the queries to Brad on Friday.
Aaron write corresponding output queries.
Send them to Brad so he can learn and help.
Create pipeline.
Generate report of any failures.
(2) After the pipeline is working for Traits, move on to SALVIAS.
Brad: After Aaron has the pipeline working, create a simple diagram documenting the quantitative pipeline.
Include links to scripts and the diff files.
Brad and Aaron: Check in with each other on Monday.
Brad: Talk with Bob and Mike separately regarding Mike's availability to write VegBank and CVS quantitative validation queries against the original db schemas.
Brad: Complete the FIA input queries.
Cultivated specimens
Brad:- Schedule a call to discuss cultivated specimens and handle it outside of the BIEN database calls. Met Feb. 3. Notes here.
Carnegie Data
Brian: Request a data sample.