To Do - BIEN db

To Do - BIEN db

Overview - Quantitative Validation Status

Task

Status

Completed

Task

Status

Completed

Specimens

 

 

input queries

written, troubleshooting NY

3/27,

output queries

written, troubleshooting NY

4/3,

pipeline

runs

 

sources validated

 

 

Plots

 

 

input queries

 

 

output queries

written

3/18

pipeline

 

 

sources validated

 

 

FIA, SALVIAS by Brad

waiting on fixes from Aaron and data reload

 

VegBank by Mike (as proxy for plots)

in progress

 

Traits

 

 

data validated

complete

2/27

2014.05.08

To Do

Aaron

  • Send Bob a CSV file with the results of the taxon name queries (described below) by today or tomorrow so he can determine whether or not the names are correctly formed.

    • The two taxon name queries check that Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Bob will check the VegBank taxon names.

  • Continue work on the make file for building a clean install of the BIEN db.

    • Keep us informed of progress -- especially Nick and Mark so they can advise you.

    • This is part of the larger issue of diagnosing and fixing the disk space leak.

Bob

  • Review how the taxon names are being formed. (within a week)

Mike

  • Continue with the VegBank validations.

  • Let us know if you need something or are waiting for something.

  • Please correspond using the bien-db@iplantcollaborative.org mailing list so everyone is kept in the loop.

Martha

  • Find out Brian's availability through the summer. 

  • Find out which datasets Brian needs the taxon name results from so he can help validate that taxon names are correctly formed.

2014.05.01

To Do

Aaron

  • By end of the day on Friday:

    • Create a query to check that Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Do this for VegBank taxon names. Instead of sending the output results to Bob to review, Aaron was waiting for Mike to write corresponding queries against VegBank for comparison purposes.

    •  Run the plot output queries on the VegBank data in VegBIEN. Let Mike know where the results tables are and give him access to them.

Bob

  • Review how the taxon names are being formed. (within a week)

Mike

  • Compare the results of the queries Aaron runs to queries you run directly against VegBank. (will take a while; keep us informed as you progress)

2014.04.17

To Do for Aaron

  • Top priority: Aaron to contact Nick this morning about the disk space leak. Aaron should work with Nick to resolve the problem since it is a system issue. Copy Mark. Let us know as soon as that is resolved.

  • Specimens Query 15: 

    • Schema change required to add locationRemarks.

    • Then reload the data to get it in that field.

    • Fix the queries.

    • Run the diff.

  • Specimens Query 13 (has problems to troubleshoot)

  • Aaron needs to fix the input filter to just filter out the projects Brad listed at the top of the MySQL queries. He'll then need to reimport the data before the SALVIAS queries can be expected to generate correct output.

  • FIA: Aaron will follow Brad's suggestion to bypass Aaron's pipeline for filtering when loading FIA data into VegBIEN. 

  • FIA: Brad will write scripts (that run after FIA data are loaded into VegBIEN) to filter and build indexes on the analytical tables.

2014.04.10

To Do for Aaron

VegBIEN schema

  • add taxonverbatim.subspecies

  • add TNRS and geoscrub tables to ERD

specimens aggregating validations

  1. fix input queries #4,5: remove subspecies IS NOT NULL filter

  2. fix input queries #6,7: add subspecies IS NOT NULL filter

  3. fix output queries #6,7: use subspecies instead of the concatenated taxonomic name

2014.04.03

Specimen output queries are written.

To Do (Aaron)

aggregating validations

  1. specimens queries

    1. implement workaround for the slowdown in query #12

    2. run pipeline on NY to generate diffs

  2. plots queries

    1. write denormalized plots input queries, using VegBank as the example datasource

    2. finish fixing plots output queries

  3. validate datasources

    1. SALVIAS

    2. denormalized plots datasources: VegBank, CVS, CTFS

    3. specimens

    4. FIA (special case, with separate input queries)

    5. normalized plots datasources: TEAM, Madidi### denormalize

      1. validate

2014.03.27

Specimen input queries are written.

Decisions

plots aggregating validations
  • won't denormalize SALVIAS because already have input queries for it (Brad)

  • validate FIA last because it's a special case (Brad)

specimens aggregating validations
  • OK to run NY validations when writing specimens output queries instead of at the end with the other specimens datasources (Brad)

  • when writing specimens output queries based on NY input queries, treat query name as authoritative rather than query implementation (Brad)

  • use taxonoccurrence as the main specimen table

  • use concatenated taxon name instead of concatenating the ranks, since not all specimens datasources provide the ranks

new-style import
  • needs to include the denormalization of normalized datasources

NY
  • use artificial key as pkey instead of removing rows that are missing an accessionNumber (Brad)

To Do for Aaron

aggregating validations
  1. finish specimens output queries

    • use concatenated taxon name instead of concatenating the ranks

    • in #1, use taxonoccurrence instead of location as the main specimen table

  2. run specimens output queries on NY to test them

  3. denormalize normalized plots datasources: TEAM, Madidi

  4. write denormalized plots input queries

  5. finish fixing plots output queries

  6. validate plots datasources: SALVIAS, VegBank, CVS, TEAM, Madidi, CTFS, FIA

  7. validate specimens datasources

new-style import
NY
FIA

2014.03.18

All plot output queries are written.

To Do for Aaron

  • 1. Write the specimen output queries

  • 2. Complete all plot input queries, including modifying Brad's FIA input queries

  • 3. Write the specimen input queries

  • 4. Validate all plot datasets

  • 5. Validate all specimen datasets

2014.03.13

Decisions

VegBIEN schema

  • for attribution purposes, every location should be linked to a dataset, which is the first publisher/primary data provider

  • to simplify the plots validation queries, locations in a plots datasource should also be linked to a project (the project that defines the plot)

    • this requires a single pseudo-project for plots datasources that don't use projects, which is not ideal (Mark)

aggregating validations

  • don't address issues (diffs) until working on input queries (Brad)

To Do

for Brad

schema changes

  • write provenance use cases

for Aaron

schema changes

  1. write up changes needed

  2. rename VegCore specimenHolderInstitutions to specimen_duplicate_institutions

  3. rename specimenreplicate.institution_id to duplicate_institutions_id

  4. add dataset table

  5. link project to dataset (for all our current datasources, effectively 1:1)

  6. link location to dataset (the first publisher/primary data provider)

  7. link location to project (the project that defines the plot)

aggregating validations

  1. implement new query #19

  2. write specimens output queries

mappings changes

  1. populate location.dataset for all datasources

    1. for aggregators, create one for each first publisher/primary data provider

    2. for plots with projects, create one for each project

    3. for herbarium specimens and plots without projects, create one for the datasource itself

  2. make location.dataset required

  3. populate project.dataset for all plots datasources

    1. for all our current datasources, project is 1:1 with dataset

  4. make project.dataset required

2014.03.06

To Do (Aaron)

  1. Finish all plot output queries. Make sure they work and are stable. This is a single set of 18 queries modeled after the SALVIAS queries Brad provided.

  2. Write specimens output queries. Make sure they work and are stable. This is a single set of 16 queries modeled after the NYBG queries Brad provided.

  3. Write plots input queries against the data in the denormalized VegCore schema. Make sure they work and are stable. This is a single set of 18 queries that match the plot output queries.

  4. Write specimens input queries against the specimen data in the vegCore schema. Make sure they work and are stable. This is a single set of 16 queries that match the specimen output queries. 

  5. Run plot validations for two plot data sets that have already been denormalized, excluding FIA (VegBank, CVS).

  6. Run all specimen validations.

  7. Denormalize the remaining plot datasources TEAM, Madidi, and CTFS so they will work with the denormalized plot input queries.

  8. Complete plot validations on TEAM, Madidi, and CTFS.

  9. As you work on completing the queries, please send Brad each day the most recent version of the queries you are working on (plot output, specimen output, plot input or specimen input) so he can gauge progress and help with any problems you might be encountering.

For Brad

  • Brad will re-number the FIA queries so they correspond with the other 18 queries.

  • Brad will validate FIA as well as SALVIAS.

2014.02.27

Trait validations completed.

decisions

SALVIAS aggregating validations

  • implement and test queries in the order they are numbered, instead of writing the blank queries first (Brad)

    • write blank queries if they are the next query in the sequence

  • test queries outside of the pipeline first

traits data (to be done by end of March)

  • add TraitObservation unmapped columns to the trait table rather than mapping them to party (Brad)

  • OK to just add new columns to the version-controlled schema, and let them get added to the live DB in the next full-database import

  • Brad will load new data to bien2_staging.TraitObservation, rather than us importing it directly into VegBIEN from the CSVs, in the interest of time (Brad)

to do for Aaron

schema changes

  1. rename location.top_plot -> plot_location_id

    • in the database, the triggers, and the FAQ

  2. send e-mail that top_plot renamed

  3. rename top_plot view -> plot

SALVIAS aggregating validations

  1. test each query outside of the pipeline, in order

  2. send e-mail that queries written

  3. implement each query in the pipeline

traits data (to be done by end of March)

  1. add TraitObservation unmapped columns to the trait table

  2. reload staging tables from bien2_staging.TraitObservation once Brad has updated it to the new traits data

TNRS client

  • add new source that Brad is adding

2014.02.20

Decisions

  • Priorities for Aaron

    • 1) Get the pipeline finished and working with all the sources.

    • 2) Document problems uncovered by the quantitative validation scripts.

    • 3) Don’t begin the fixes until the entire pipeline is in place.

    • 4) Due to time limitations, some triaging of problems revealed by the quantitative validation process will need to be done.

Problem 

  • It's unclear who will be available to work with Aaron to help him fix any problems that are uncovered by the validation scripts.

To Do

Aaron

aggregating validations

  1. get pipeline working

  2. fix datasource-selection bug on output side

  3. traits aggregating validations

  4. SALVIAS aggregating validations

Martha

  • Look into getting the iPlant assistance for Brad on TNRS.

2014.02.13

Decisions

  • Work to make validation results accessible to the data provider as CSV files should be done LATER, not now, and is part of Brad's UI work, not Aaron's database development work.

  • The queries that result in a count would be a useful to report in a logfile. The data provider could look at the numeric values and would know if they were correct or not. Report the column header and count.

    • This would help (in part) address our concerns about validating against the staging tables instead of the original data source.

To Do

Aaron

Diagram

  • Add details to the quantitative validation workflow diagram as discussed, including links to examples. (DONE)

Development

  • Complete implementation of the input pipeline and table join (diff)

  • Send results from Traits to the group.

  • Then move on to SALVIAS and other data sources (if time permits.)

2014.02.06

Decisions

  • Aaron will use SQL to compare the records instead of a diff of text files.

  • If one query takes more than 10 minutes, add a record limit (5000 or 10,000), but keep the number of records as large as possible.

  • In the interest of time, Brad will keep the cultivated specimen data sources in a separate database outside of BIEN.

To Do

Aaron

  • Make a high level diagram to document the validation pipeline.(DONE)

  • Continue work on the Traits validation pipeline; run it and let us know how the validations came out.

  • For traits, remove the empty rows (pre-delete them) that resulted from records that lacked a taxon name.

    • Then adjust the queries.

  • Reminder to contact Brad promptly with any questions or if problems arise. Don't wait for the Thursday call.

Brad

  • Continue work on cultivated specimens and taxonomy.

Brad and Aaron

  • Reminder to send emails to the entire group so everyone is in the loop and can help.

2014.01.30

Decisions

  • (Not a decision, but on the table) Due to Mike's time constraints, we may have to consider doing the VegBank and CVS quantitative validations against the VegCore schema instead of against the original databases since so much careful spot-checking of these sources has already been done.

To Do

Quantitative Validations

  • Aaron: (1) For the quantitative validation pipeline, begin with traits:

    • Edit Brad’s input queries for traits to use vegCore names.

    • Complete Brad’s taxon name related queries (5,6,8). 

      • Send the queries to Brad on Friday.

    • Aaron write corresponding output queries. 

      • Send them to Brad so he can learn and help.

    • Create pipeline.

    • Generate report of any failures.

    • (2) After the pipeline is working for Traits, move on to SALVIAS.

  • Brad: After Aaron has the pipeline working, create a simple diagram documenting the quantitative pipeline. 

    • Include links to scripts and the diff files.

  • Brad and Aaron: Check in with each other on Monday.

  • Brad: Talk with Bob and Mike separately regarding Mike's availability to write VegBank and CVS quantitative validation queries against the original db schemas.

  • Brad: Complete the FIA input queries.

Cultivated specimens

  • Brad:Schedule a call to discuss cultivated specimens and handle it outside of the BIEN database calls. Met Feb. 3. Notes here.

Carnegie Data

  • Brian: Request a data sample.