Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Overview - Quantitative Validation Status

...

  • Send Bob a CSV file with the results of the taxon name queries (described below) by today or tomorrow so he can determine whether or not the names are correctly formed.
    • The two taxon name queries check that Aaron's Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Bob will check the VegBank taxon names.
  • Continue work on the make file for building a clean install of the BIEN db.
    • Keep us informed of progress -- especially Nick and Mark so they can advise you.
    • This is part of the larger issue of diagnosing and fixing the disk space leak.

...

  • By end of the day on Friday:
    • Create a query to check that Aaron's Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Do this for VegBank taxon names. Instead of sending the output results to Bob to review, Aaron was waiting for Mike to write corresponding queries against VegBank for comparison purposes.
    •  Run the plot output queries on the VegBank data in VegBIEN. Let Mike know where the results tables are and give him access to them.

...

  1. fix input queries #4,5: remove subspecies IS NOT NULL filter
  2. fix input queries #6,7: add subspecies IS NOT NULL filter
  3. fix output queries #6,7: use subspecies instead of the concatenated taxonomic name

2014.04.03

Specimen output queries are written.

To Do (Aaron)

aggregating validations

  1. specimens queries
    1. implement workaround for the slowdown in query #12
    2. run pipeline on NY to generate diffs
  2. plots queries
    1. write denormalized plots input queries, using VegBank as the example datasource
    2. finish fixing plots output queries
  3. validate datasources
    1. SALVIAS
    2. denormalized plots datasources: VegBank, CVS, CTFS
    3. specimens
    4. FIA (special case, with separate input queries)
    5. normalized plots datasources: TEAM, Madidi### denormalize
      1. validate

2014.03.27

Specimen input queries are written.

Decisions

plots aggregating validations
  • won't denormalize SALVIAS because already have input queries for it (Brad)
  • validate FIA last because it's a special case (Brad)
specimens aggregating validations
  • OK to run NY validations when writing specimens output queries instead of at the end with the other specimens datasources (Brad)
  • when writing specimens output queries based on NY input queries, treat query name as authoritative rather than query implementation (Brad)
  • use taxonoccurrence as the main specimen table
  • use concatenated taxon name instead of concatenating the ranks, since not all specimens datasources provide the ranks
new-style import
  • needs to include the denormalization of normalized datasources

...

NY

...

  • use artificial key as pkey instead of removing rows that are missing an accessionNumber (Brad)

To Do for Aaron

aggregating validations
  1. finish specimens output queries
    • use concatenated taxon name instead of concatenating the ranks
    • in #1, use taxonoccurrence instead of location as the main specimen table
  2. run specimens output queries on NY to test them
  3. denormalize normalized plots datasources: TEAM, Madidi
  4. write denormalized plots input queries
  5. finish fixing plots output queries
  6. validate plots datasources: SALVIAS, VegBank, CVS, TEAM, Madidi, CTFS, FIA
  7. validate specimens datasources
new-style import

...

NY

...

...

FIA

...

2014.03.18

All plot output queries are written.

To Do for Aaron

  • 1. Write the specimen output queries
  • 2. Complete all plot input queries, including modifying Brad's FIA input queries
  • 3. Write the specimen input queries
  • 4. Validate all plot datasets
  • 5. Validate all specimen datasets

...

aggregating validations

  1. implement new query #19
  2. write specimens output queries

mappings changes

...

  • Brad will re-number the FIA queries so they correspond with the other 18 queries.
  • Brad will validate FIA as well as SALVIAS.

2014.02.27

Trait validations completed.

decisions

SALVIAS aggregating validations

...

  1. test each query outside of the pipeline, in order
  2. send e-mail that queries written
  3. implement each query in the pipeline

traits data (to be done by end of March)

  1. add TraitObservation unmapped columns to the trait table
  2. reload staging tables from bien2_staging.TraitObservation once Brad has updated it to the new traits data

...

  • Priorities for Aaron
    • 1) Get the pipeline finished and working with all the sources.
    • 2) Document problems uncovered by the quantitative validation scripts.
    • 3) Don't Don’t begin the fixes until the entire pipeline is in place.
    • 4) Due to time limitations, some triaging of problems revealed by the quantitative validation process will need to be done.

Problem 

  • It's unclear who will be available to work with Aaron to help him fix any problems that are uncovered by the validation scripts.

...

Aaron

aggregating validations

  1. get pipeline working
  2. fix datasource-selection bug on output side
  3. traits aggregating validations
  4. SALVIAS aggregating validations

Martha

  • Look into getting the iPlant assistance for Brad on TNRS.

2014.02.13

Decisions

  • Work to make validation results accessible to the data provider as CSV files should be done LATER, not now, and is part of Brad's UI work, not Aaron's database development work.
  • The queries that result in a count would be a useful to report in a logfile. The data provider could look at the numeric values and would know if they were correct or not. Report the column header and count.
    • This would help (in part) address our concerns about validating against the staging tables instead of the original data source.

...

  • Aaron: (1) For the quantitative validation pipeline, begin with traits:
    • Edit Brad's Brad’s input queries for traits to use vegCore names.
    • Complete Brad's Brad’s taxon name related queries (5,6,8). 
      • Send the queries to Brad on Friday.
    • Aaron write corresponding output queries. 
      • Send them to Brad so he can learn and help.
    • Create pipeline.
    • Generate report of any failures.
    • (2) After the pipeline is working for Traits, move on to SALVIAS.
  • Brad: After Aaron has the pipeline working, create a simple diagram documenting the quantitative pipeline. 
    • Include links to scripts and the diff files.
  • Brad and Aaron: Check in with each other on Monday.
  • Brad: Talk with Bob and Mike separately regarding Mike's availability to write VegBank and CVS quantitative validation queries against the original db schemas.
  • Brad: Complete the FIA input queries.

Cultivated specimens

  • Brad:Schedule a call to discuss cultivated specimens and handle it outside of the BIEN database calls. Met Feb. 3. Notes here.

Carnegie Data

  • Brian: Request a data sample.
    • Let them know BIEN (or VegBank) would be ready to ingest it around April.

...

  • Martha: Schedule for Th or F of next week.  Scheduled for Feb. 6th.

2014.01.23

Decisions 

  • For specimens, validate against the VegCORE staging tables.
  • For plots, validate against the unmapped original database column names.
    • Brad will write the queries against the unmapped original databases, and Mike need to do so for VegBank and CVS.

...

  • Write the blank quantitative validations queries and fix the queries that have comments in CAPS.
    • Review the taxon related queries since Brad is less confident about them.
    • Let Brad and Mike know when the queries are ready since they'll need to refer to them to write the plot validation queries.
  • Make any necessary easy schema changes for the queries (above).
    • Inform the group if any difficult schema changes arise.
  • Create the quantitative validation pipeline. 
    • Do all queries on the postgres databases in house, so don't don’t have to match up between mysql and postgres. Will need to translate sql queries to postgres.
  • For plots, unmap the staging table names.
    • Let Brad and Mike know when tables/dbs with original plot column names are ready.
  • For specimens, write the queries against the VegCORE staging tables for each specimen source.

...

  • Brad will write the queries against (most) unmapped original databases.
  • Mike will need to do so for VegBank and CVS. Mike may not have time.

2014.01.16

Decisions and Clarifications

Decision: Take validation of BIEN2 traits off Aaron's Aaron’s plate. Brad will do it.

Clarification: For now, Aaron should not work on the other issues found from spot checking other data sources. That is a lower priority than the Quantitative Validation work.

To Do

Plot (and Project) Data Providers

Aaron: For SALVIAS, complete the work as described in Item 2. "Plot data providers" from Brad's "High priority tasks" email message of Dec. 17/18.

BIEN2 Traits

Brad: Validate the BIEN2 trait data, taking this task off Aaron's plate. Use the VegBIEN normalized trait table and the oringinal input data from BIEN2.

  • Spot check the data.
  • Write quantitative validation queries.
  • Send queries to Aaron so he can put them into the validation pipeline.

Aaron: After Brad sends you his BIEN2 Trait quantitative validation queries, put them into the validation pipeline.

Quantitative Validations

Brad: Send Aaron the queries on the original SALVIAS database, which he forgot to attach previously.

Aaron: Work on Items 3.1 and 3.2 described in "Plot data providers" in Brad's "High priority tasks" email message of Dec. 17/18.

  • After the queries (12,13,15) are fixed, send them back to Brad so he understands where his mistakes were.

2014.01.09

To Do

Aaron

  • Item 2. "Plot data providers" from Brad's "High priority tasks" email message of Dec. 17/18.
  • GBIF
    • Taxon names (in Jan. for beta)
      • Code a workaround for the accepted names that are missing family names.
      • Then, re-run the names that are missing family names.
  • BIEN2 Traits (in Jan. for beta) Reassigned to Brad 01.16** Aaron will left join "the view that left joins all the tables" to the additional trait information for Brad to validate.
  • 3. Quantitative validations: Aaron, don't worry about this item yet. After the call Martha saw in her Dec. 19 meeting notes, we decided item 3. Quantitative validations from the "High priority tasks" email should NOT be considered part of the beta milestone. We decided to do it AFTER beta.

...

  • VegBank, CVS and SALVIAS (in Dec.)
    • Highest priority fixes for Aaron to complete first are items #1 and #2 specified in Brad's email subject "High priority tasks" date Dec. 17.
  • GBIF
    • Taxon names (in Jan. for beta)
      • Code a workaround for the accepted names that are missing family names.
      • Then, re-run the names that are missing family names.
  • BIEN2 Traits (in Jan. for beta)
    • Aaron will left join "the view that left joins all the tables" to the additional trait information for Brad to validate. Reassigned to Brad 01.16

Aaron (later, after beta)
These will be prioritized later.

...

We would like you to complete these tasks in the order listed. Please try to complete tasks 1 and 2 this week, and get started on task 3 in early January when you return to work.

1. Projects completed 01.09

Correct mapping and loading scripts for VegBank and CVS to ensure that projects can be loaded correctly to the core database. No core schema changes should be required. Do whatever is necessary to load this information into the core database; either (a) reload the entire dataset, or (b) write a custom single-use script to add the missing projects and link them the relevant plots and sources.

2. Plot data providers completed for VegBank 01.16, SALVIAS 01.23; CVS?

Make all changes necessary to mappings, loading scripts and core schema to ensure that primary data providers for plots can be loaded to the core database and correctly linked to the plots for which they are responsible. Do whatever is necessary to actually load this information into the core database: either (a) reload the entire dataset, or (b) write a custom single-use script to add the missing data providers to the core database and link them the relevant plots, projects and sources. I have added 2 new tables to salvias_plots on nimoy which contains the missing information on SALVIAS data providers (party_code_party and party). Please use this table when revising the SALVIAS loading scripts. Bob and Mike will answer questions you might have regarding VegBank and CVS data providers.

...