To Do - BIEN db
Overview - Quantitative Validation Status
Task |
Status |
Completed |
---|---|---|
Specimens |
|
|
input queries |
written, troubleshooting NY |
3/27, |
output queries |
written, troubleshooting NY |
4/3, |
pipeline |
runs |
|
sources validated |
|
|
Plots |
|
|
input queries |
|
|
output queries |
written |
3/18 |
pipeline |
|
|
sources validated |
|
|
FIA, SALVIAS by Brad |
waiting on fixes from Aaron and data reload |
|
VegBank by Mike (as proxy for plots) |
in progress |
|
Traits |
|
|
data validated |
complete |
2/27 |
2014.05.08
To Do
Aaron
- Send Bob a CSV file with the results of the taxon name queries (described below) by today or tomorrow so he can determine whether or not the names are correctly formed.
- The two taxon name queries check that Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Bob will check the VegBank taxon names.
- Continue work on the make file for building a clean install of the BIEN db.
- Keep us informed of progress -- especially Nick and Mark so they can advise you.
- This is part of the larger issue of diagnosing and fixing the disk space leak.
Bob
- Review how the taxon names are being formed. (within a week)
Mike
- Continue with the VegBank validations.
- Let us know if you need something or are waiting for something.
- Please correspond using the bien-db@iplantcollaborative.org mailing list so everyone is kept in the loop.
Martha
- Find out Brian's availability through the summer.
- Find out which datasets Brian needs the taxon name results from so he can help validate that taxon names are correctly formed.
2014.05.01
To Do
Aaron
- By end of the day on Friday:
- Create a query to check that Aaron’s scripts 1) correctly concatenate the verbatim family + raw taxon name (including morphospecies) *prior* to sending to the TNRS and 2) display correctly the resolved name returned by the TNRS. Do this for VegBank taxon names. Instead of sending the output results to Bob to review, Aaron was waiting for Mike to write corresponding queries against VegBank for comparison purposes.
- Run the plot output queries on the VegBank data in VegBIEN. Let Mike know where the results tables are and give him access to them.
Bob
- Review how the taxon names are being formed. (within a week)
Mike
- Compare the results of the queries Aaron runs to queries you run directly against VegBank. (will take a while; keep us informed as you progress)
2014.04.17
To Do for Aaron
- Top priority: Aaron to contact Nick this morning about the disk space leak. Aaron should work with Nick to resolve the problem since it is a system issue. Copy Mark. Let us know as soon as that is resolved.
- Specimens Query 15:
- Schema change required to add locationRemarks.
- Then reload the data to get it in that field.
- Fix the queries.
- Run the diff.
- Specimens Query 13 (has problems to troubleshoot)
- Aaron needs to fix the input filter to just filter out the projects Brad listed at the top of the MySQL queries. He'll then need to reimport the data before the SALVIAS queries can be expected to generate correct output.
- FIA: Aaron will follow Brad's suggestion to bypass Aaron's pipeline for filtering when loading FIA data into VegBIEN.
- FIA: Brad will write scripts (that run after FIA data are loaded into VegBIEN) to filter and build indexes on the analytical tables.
2014.04.10
To Do for Aaron
- add taxonverbatim.subspecies
- add TNRS and geoscrub tables to ERD
specimens aggregating validations
- fix input queries #4,5: remove subspecies IS NOT NULL filter
- fix input queries #6,7: add subspecies IS NOT NULL filter
- fix output queries #6,7: use subspecies instead of the concatenated taxonomic name
2014.04.03
Specimen output queries are written.
To Do (Aaron)
- specimens queries
- implement workaround for the slowdown in query #12
- run pipeline on NY to generate diffs
- plots queries
- write denormalized plots input queries, using VegBank as the example datasource
- finish fixing plots output queries
- validate datasources
- SALVIAS
- denormalized plots datasources: VegBank, CVS, CTFS
- specimens
- FIA (special case, with separate input queries)
- normalized plots datasources: TEAM, Madidi### denormalize
- validate
2014.03.27
Specimen input queries are written.
Decisions
plots aggregating validations
- won't denormalize SALVIAS because already have input queries for it (Brad)
- validate FIA last because it's a special case (Brad)
specimens aggregating validations
- OK to run NY validations when writing specimens output queries instead of at the end with the other specimens datasources (Brad)
- when writing specimens output queries based on NY input queries, treat query name as authoritative rather than query implementation (Brad)
- use taxonoccurrence as the main specimen table
- use concatenated taxon name instead of concatenating the ranks, since not all specimens datasources provide the ranks
new-style import
- needs to include the denormalization of normalized datasources
NY
- use artificial key as pkey instead of removing rows that are missing an accessionNumber (Brad)
To Do for Aaron
aggregating validations
- finish specimens output queries
- use concatenated taxon name instead of concatenating the ranks
- in #1, use taxonoccurrence instead of location as the main specimen table
- run specimens output queries on NY to test them
- denormalize normalized plots datasources: TEAM, Madidi
- write denormalized plots input queries
- finish fixing plots output queries
- validate plots datasources: SALVIAS, VegBank, CVS, TEAM, Madidi, CTFS, FIA
- validate specimens datasources
new-style import
- denormalize normalized plots datasources: TEAM, Madidi, later SALVIAS
NY
- use artificial key as pkey instead of accessionNumber
FIA
- remap what's mapped to locationName to a suffix of it instead (locationName itself corresponds to plotCode below)
- map INVYR to a suffix of authorEventCode (authorEventCode itself corresponds to plotCensusCode below)
- add plotCode, plotCensusCode derived columns:
plotCode = CONCAT_WS("_",STATECD,COUNTYCD,PLOT)
plotCensusCode = CONCAT_WS("_",STATECD,COUNTYCD,PLOT,INVYR)- mapped to locationName, authorEventCode
2014.03.18
All plot output queries are written.
To Do for Aaron
- 1. Write the specimen output queries
- 2. Complete all plot input queries, including modifying Brad's FIA input queries
- 3. Write the specimen input queries
- 4. Validate all plot datasets
- 5. Validate all specimen datasets
2014.03.13
Decisions
- for attribution purposes, every location should be linked to a dataset, which is the first publisher/primary data provider
- to simplify the plots validation queries, locations in a plots datasource should also be linked to a project (the project that defines the plot)
- this requires a single pseudo-project for plots datasources that don't use projects, which is not ideal (Mark)
- don't address issues (diffs) until working on input queries (Brad)
To Do
for Brad
schema changes
- write provenance use cases
for Aaron
schema changes
- write up changes needed
- rename VegCore specimenHolderInstitutions to specimen_duplicate_institutions
- rename specimenreplicate.institution_id to duplicate_institutions_id
- add dataset table
- link project to dataset (for all our current datasources, effectively 1:1)
- link location to dataset (the first publisher/primary data provider)
- link location to project (the project that defines the plot)
- implement new query #19
- write specimens output queries
mappings changes
- populate location.dataset for all datasources
- for aggregators, create one for each first publisher/primary data provider
- for plots with projects, create one for each project
- for herbarium specimens and plots without projects, create one for the datasource itself
- make location.dataset required
- populate project.dataset for all plots datasources
- for all our current datasources, project is 1:1 with dataset
- make project.dataset required
2014.03.06
To Do (Aaron)
- Finish all plot output queries. Make sure they work and are stable. This is a single set of 18 queries modeled after the SALVIAS queries Brad provided.
- Write specimens output queries. Make sure they work and are stable. This is a single set of 16 queries modeled after the NYBG queries Brad provided.
- Write plots input queries against the data in the denormalized VegCore schema. Make sure they work and are stable. This is a single set of 18 queries that match the plot output queries.
- Write specimens input queries against the specimen data in the vegCore schema. Make sure they work and are stable. This is a single set of 16 queries that match the specimen output queries.
- Run plot validations for two plot data sets that have already been denormalized, excluding FIA (VegBank, CVS).
- Run all specimen validations.
- Denormalize the remaining plot datasources TEAM, Madidi, and CTFS so they will work with the denormalized plot input queries.
- Complete plot validations on TEAM, Madidi, and CTFS.
- As you work on completing the queries, please send Brad each day the most recent version of the queries you are working on (plot output, specimen output, plot input or specimen input) so he can gauge progress and help with any problems you might be encountering.
For Brad
- Brad will re-number the FIA queries so they correspond with the other 18 queries.
- Brad will validate FIA as well as SALVIAS.
2014.02.27
Trait validations completed.
decisions
SALVIAS aggregating validations
- implement and test queries in the order they are numbered, instead of writing the blank queries first (Brad)
- write blank queries if they are the next query in the sequence
- test queries outside of the pipeline first
traits data (to be done by end of March)
- add TraitObservation unmapped columns to the trait table rather than mapping them to party (Brad)
- OK to just add new columns to the version-controlled schema, and let them get added to the live DB in the next full-database import
- Brad will load new data to bien2_staging.TraitObservation, rather than us importing it directly into VegBIEN from the CSVs, in the interest of time (Brad)
to do for Aaron
schema changes
- rename location.top_plot -> plot_location_id
- in the database, the triggers, and the FAQ
- send e-mail that top_plot renamed
- rename top_plot view -> plot
SALVIAS aggregating validations
- test each query outside of the pipeline, in order
- send e-mail that queries written
- implement each query in the pipeline
traits data (to be done by end of March)
- add TraitObservation unmapped columns to the trait table
- reload staging tables from bien2_staging.TraitObservation once Brad has updated it to the new traits data
TNRS client
- add new source that Brad is adding
2014.02.20
Decisions
- Priorities for Aaron
- 1) Get the pipeline finished and working with all the sources.
- 2) Document problems uncovered by the quantitative validation scripts.
- 3) Don’t begin the fixes until the entire pipeline is in place.
- 4) Due to time limitations, some triaging of problems revealed by the quantitative validation process will need to be done.
Problem
- It's unclear who will be available to work with Aaron to help him fix any problems that are uncovered by the validation scripts.
To Do
Aaron
aggregating validations
- get pipeline working
- fix datasource-selection bug on output side
- traits aggregating validations
- SALVIAS aggregating validations
Martha
- Look into getting the iPlant assistance for Brad on TNRS.
2014.02.13
Decisions
- Work to make validation results accessible to the data provider as CSV files should be done LATER, not now, and is part of Brad's UI work, not Aaron's database development work.
- The queries that result in a count would be a useful to report in a logfile. The data provider could look at the numeric values and would know if they were correct or not. Report the column header and count.
- This would help (in part) address our concerns about validating against the staging tables instead of the original data source.
To Do
Aaron
Diagram
- Add details to the quantitative validation workflow diagram as discussed, including links to examples. (DONE)
Development
- Complete implementation of the input pipeline and table join (diff)
- Send results from Traits to the group.
- Then move on to SALVIAS and other data sources (if time permits.)
2014.02.06
Decisions
- Aaron will use SQL to compare the records instead of a diff of text files.
- If one query takes more than 10 minutes, add a record limit (5000 or 10,000), but keep the number of records as large as possible.
- In the interest of time, Brad will keep the cultivated specimen data sources in a separate database outside of BIEN.
To Do
Aaron
- Make a high level diagram to document the validation pipeline.(DONE)
- Continue work on the Traits validation pipeline; run it and let us know how the validations came out.
- For traits, remove the empty rows (pre-delete them) that resulted from records that lacked a taxon name.
- Then adjust the queries.
- Reminder to contact Brad promptly with any questions or if problems arise. Don't wait for the Thursday call.
Brad
- Continue work on cultivated specimens and taxonomy.
Brad and Aaron
- Reminder to send emails to the entire group so everyone is in the loop and can help.
2014.01.30
Decisions
- (Not a decision, but on the table) Due to Mike's time constraints, we may have to consider doing the VegBank and CVS quantitative validations against the VegCore schema instead of against the original databases since so much careful spot-checking of these sources has already been done.
To Do
Quantitative Validations
- Aaron: (1) For the quantitative validation pipeline, begin with traits:
- Edit Brad’s input queries for traits to use vegCore names.
- Complete Brad’s taxon name related queries (5,6,8).
- Send the queries to Brad on Friday.
- Aaron write corresponding output queries.
- Send them to Brad so he can learn and help.
- Create pipeline.
- Generate report of any failures.
- (2) After the pipeline is working for Traits, move on to SALVIAS.
- Brad: After Aaron has the pipeline working, create a simple diagram documenting the quantitative pipeline.
- Include links to scripts and the diff files.
- Brad and Aaron: Check in with each other on Monday.
- Brad: Talk with Bob and Mike separately regarding Mike's availability to write VegBank and CVS quantitative validation queries against the original db schemas.
- Brad: Complete the FIA input queries.
Cultivated specimens
- Brad:- Schedule a call to discuss cultivated specimens and handle it outside of the BIEN database calls. Met Feb. 3. Notes here.
Carnegie Data
- Brian: Request a data sample.
- Let them know BIEN (or VegBank) would be ready to ingest it around April.
Planning call
- Martha: Schedule for Th or F of next week. Scheduled for Feb. 6th.
2014.01.23
Decisions
- For specimens, validate against the VegCORE staging tables.
- For plots, validate against the unmapped original database column names.
- Brad will write the queries against the unmapped original databases, and Mike need to do so for VegBank and CVS.
To Do
Aaron
- Write the blank quantitative validations queries and fix the queries that have comments in CAPS.
- Review the taxon related queries since Brad is less confident about them.
- Let Brad and Mike know when the queries are ready since they'll need to refer to them to write the plot validation queries.
- Make any necessary easy schema changes for the queries (above).
- Inform the group if any difficult schema changes arise.
- Create the quantitative validation pipeline.
- Do all queries on the postgres databases in house, so don’t have to match up between mysql and postgres. Will need to translate sql queries to postgres.
- For plots, unmap the staging table names.
- Let Brad and Mike know when tables/dbs with original plot column names are ready.
- For specimens, write the queries against the VegCORE staging tables for each specimen source.
Brad and Mike
- Brad will write the queries against (most) unmapped original databases.
- Mike will need to do so for VegBank and CVS. Mike may not have time.
2014.01.16
Decisions and Clarifications
Decision: Take validation of BIEN2 traits off Aaron’s plate. Brad will do it.
Clarification: For now, Aaron should not work on the other issues found from spot checking other data sources. That is a lower priority than the Quantitative Validation work.
To Do
Plot (and Project) Data Providers
Aaron: For SALVIAS, complete the work as described in Item 2. "Plot data providers" from Brad's "High priority tasks" email message of Dec. 17/18.
BIEN2 Traits
Brad: Validate the BIEN2 trait data, taking this task off Aaron's plate. Use the VegBIEN normalized trait table and the oringinal input data from BIEN2.
- Spot check the data.
- Write quantitative validation queries.
- Send queries to Aaron so he can put them into the validation pipeline.
Aaron: After Brad sends you his BIEN2 Trait quantitative validation queries, put them into the validation pipeline.
Quantitative Validations
Brad: Send Aaron the queries on the original SALVIAS database, which he forgot to attach previously.
Aaron: Work on Items 3.1 and 3.2 described in "Plot data providers" in Brad's "High priority tasks" email message of Dec. 17/18.
- After the queries (12,13,15) are fixed, send them back to Brad so he understands where his mistakes were.
2014.01.09
To Do
Aaron
- Item 2. "Plot data providers" from Brad's "High priority tasks" email message of Dec. 17/18.
- GBIF
- Taxon names (in Jan. for beta)
- Code a workaround for the accepted names that are missing family names.
- Then, re-run the names that are missing family names.
- Taxon names (in Jan. for beta)
- BIEN2 Traits (in Jan. for beta) Reassigned to Brad 01.16** Aaron will left join "the view that left joins all the tables" to the additional trait information for Brad to validate.
- 3. Quantitative validations: Aaron, don't worry about this item yet. After the call Martha saw in her Dec. 19 meeting notes, we decided item 3. Quantitative validations from the "High priority tasks" email should NOT be considered part of the beta milestone. We decided to do it AFTER beta.
Martha
- Remind Brian about the SALVIAS publishable data issue.
- From last week: Add to the BIEN development plan under serving data > API > interface,
- Set things up so that a user could search TRY simultaneously when searching BIEN.
- Then, user could request any TRY data that is not already in BIEN.
2013.12.19
To Do
Aaron (for beta version of BIEN3)
- VegBank, CVS and SALVIAS (in Dec.)
- Highest priority fixes for Aaron to complete first are items #1 and #2 specified in Brad's email subject "High priority tasks" date Dec. 17.
- GBIF
- Taxon names (in Jan. for beta)
- Code a workaround for the accepted names that are missing family names.
- Then, re-run the names that are missing family names.
- Taxon names (in Jan. for beta)
- BIEN2 Traits (in Jan. for beta)
- Aaron will left join "the view that left joins all the tables" to the additional trait information for Brad to validate. Reassigned to Brad 01.16
Aaron (later, after beta)
These will be prioritized later.
- Quantitative Validations: write scripts to do quantitative validation on all data sources.
- Complete the ERD.
- Before the normalized BIEN db version 3.0 is declared ready:
- Refresh the TNRS cache.
- Reload all BIEN data.
- Geoscrubbing: Standardized country names not returned for records lacking lat/lon.
- Trait data from Cyrill
- Add Brad's PHP script into the BIEN3 pipeline to enable import of new trait data from Cyrill.
Brian (for beta)
- Address the SALVIAS publishable data sources issue.
- See Brad's email Subject: "publishable datasources", Date: Dec. 17
Brad, Bob, Mike (for beta)
- Review any remaining validation extracts Aaron sends you.
Martha
- Add to the BIEN development plan under serving data > API > interface,
- Set things up so that a user could search TRY simultaneously when searching BIEN.
- Then, user could request any TRY data that is not already in BIEN.
2013.12. 17 High Priority Tasks
email from Brad Dec. 17, 2013
Hi Aaron,
As a results of our meeting this morning, we decided that we will not be ready for a beta release in December. The reason for this is that we must complete the following tasks before making the data available.
We would like you to complete these tasks in the order listed. Please try to complete tasks 1 and 2 this week, and get started on task 3 in early January when you return to work.
1. Projects completed 01.09
Correct mapping and loading scripts for VegBank and CVS to ensure that projects can be loaded correctly to the core database. No core schema changes should be required. Do whatever is necessary to load this information into the core database; either (a) reload the entire dataset, or (b) write a custom single-use script to add the missing projects and link them the relevant plots and sources.
2. Plot data providers completed for VegBank 01.16, SALVIAS 01.23; CVS?
Make all changes necessary to mappings, loading scripts and core schema to ensure that primary data providers for plots can be loaded to the core database and correctly linked to the plots for which they are responsible. Do whatever is necessary to actually load this information into the core database: either (a) reload the entire dataset, or (b) write a custom single-use script to add the missing data providers to the core database and link them the relevant plots, projects and sources. I have added 2 new tables to salvias_plots on nimoy which contains the missing information on SALVIAS data providers (party_code_party and party). Please use this table when revising the SALVIAS loading scripts. Bob and Mike will answer questions you might have regarding VegBank and CVS data providers.
3. Quantitative validations
- 3.1. Complete quantitative validation scripts for SALVIAS. See my Dec. 10 message "Quantitative validations for SALVIAS". Fix queries 12, 13 and 15 so they run correctly. Please send them back to me when they are fixed.
- 3.2. I will send you the validation queries for the original SALVIAS database. After you receive them, run a diff on the results of both sets of queries and report the results to me and the BIEN DB group
- 3.3. Write the quantitative validations for NYBG against the core BIEN database. I have already sent you the equivalent queries in MySQL against the raw data on nimoy (see my message of Dec. 12, "Quantitative validations for NYBG"). Run a diff on the results of both sets of queries and report the results to me and the BIEN DB group.
- 3.4. Repeat the quantitative validation process for each plot data source. These should produce equivalent result sets to my SALVIAS example queries for vegbien. You will need to custom set of queries for each original data source. You will not need to write new vegbien queries; just use the SALVIAS vegbien queries and change the parameters. Perform a diff on the bien and original db results for each data source and report back to me and the BIEN DB group.
- 3.5. Repeat the quantitative validation process for each specimen data source. These should produce equivalent result sets to my NYBG example queries. u will need to custom set of queries for each original specimen data source. You will not need to write new vegbien queries; just use the vegbien queries you wrote previously for NYBG and change the parameters. Perform a diff on the bien and original db result sets for each data source and report back to me and the BIEN DB group.
Please track your progress with quantitative validations on the BIEN development wiki.
Let me know if you have any questions.
Brad