Hackpads are smart collaborative documents. .

Reagan Moore

386 days ago
  • Jim Myers: SEAD data services
Christine B
  • Why do we need data patriotism?!
  • Active (source project) and social (user community) curation
  • Q&A
  • Jianwu Wang asked what will happen when the project is over?  SEAD has agreements with Indiana, ICPSR and others that extends 5 or more years after the project.  Helping to start the National Data Service could be how it extends beyond the project lifetime.
  • Reagan Moore asked 1. if SEAD has services to help evaluate Data Management Plans?  One could apply it with some work.  SEAD has the query interfaces, but hasn't made it into a service.  Could iRODS pass a rule to SEAD's API?  Use case, identify which files are missing metadata.  
  • 2. Are there access controls?  Yes.  You can define in SEAD.
  • 3. Auditing, can I track what others in team have been doing?  Yes, at some level.  Can track who views page to who added tags, metadata and on what date.  Major things recorded.  This can provide information needed for enforcing ISO 16363 trustworthiness criteria.
  • Christine Kirkpatrick asked about using SEAD in front of repositories stored outside of SEAD, especially for large files.  With a Globus end point, can use SEAD to add metadata.
  •  
  • Rebecca Koskela: DataONE (Data Observation Network for Earth)
  • Major CI components: Coordinating nodes, Member nodes and Investigator Toolkit.
  • New, different ways to become a member node
  • Providing a registry service for member nodes
  • Emphasis on provenance and semantics-enabled discovery
  • Data services registry - currently only available for same service member node data
  • Q&A
  • Reagan Moore asked if they've worked with Cyverse?  Yes, they are talking to them currently. Mike Conway said they're starting with hydroshare running dockerized hydrology workflows via the cyverse middleware stack.  This would be appropriate for publication on DataONE via member node software.
  • How sites can check whether or not they're meeting their DMP requirements?  Rebecca said it could be added to the registry if someone added it as a service.
  • All NSF DMP reqs (38) can be mapped to ISOTrust requirements.
  • What about the sustainability of the software you've developed?  They've put on github.  They also have a Slack channel for developers to communicate.  Reagan Moore noted this is one of the hopes for the BDHubs, that they would help with this aspect of sustainability.
  • Christine Kirkpatrick asked if they are looking for new coordinating nodes?  Yes, Australia is interested in adding one.  All coordinating nodes are replicas.  DataONE doesn't own any data, has to all be open access, public data.
  • Jim Myers asked if they've looked at their statistics?  Yes, the data are very different, varies from one file to 1,000's.  What is the best metric to use?  They use uploads, downloads for activity and data citations (for reuse).
 
 
Participants [ Please add your name and email]: 
Christine Kirkpatrick, NDS (SDSC/NCSA), West Big Data Hub
Zeydy O Zeydy Ortiz, DataCrunch Lab
Karl G Jason Coposky, RENCI
Mike Conway, DataNet Federation Consortium
Karl Gustafson, South Big Data Hub
John Moore
One call in
 
 
 
Notes [PLEASE ADD YOUR NOTES AND COMMENTS HERE]:
Thank you all. If you want to provide a demo to the group, please contact Lea Shanley lshanley@renci.org. 
  • Meeting 5:  October 14, 2016 3:00-4:30 PM ET
 
Ilya Baldin
 
Jim Myers
 
471 days ago
 
South Big Data Hub Community Engagement Working Group
 
Purpose: 
Several forums have identified use cases as an essential way to educate and help new entrants to HPC or data science navigate the decisions surrounding which cyberinfrastructure tools are appropriate for a given project.  
The working group seeks to collect tangible use cases for different domains, forms of cyberinfrastructure, and software tools. There will be weekly updates and monthly calls to develop 10 categories of cyber infrastructure and several narrative style use cases within each category (i.e. what has worked and what should not be repeated").
 
Cyberinfrastructure Categories {Initial Draft please feel free to edit and fill in}
 
 Bridge Health data
Lea S  Coastal Hazards
 Smart Cities
 Smart Communities
 Smart Grids
Carol F  Transportation
 IP concerns 
 Rights on Data and Analytics
Lea S  Helping people with Data Management Plans
 
 Use Cases for:
 Cyverse 
Nitin S  Data Management
 Security of Scientific Provenance Data
 
 
 
 
 
 Best Practises
 
  • Big Data Tools
  • Tool 1: Hadoop
  • Use Case 1:  
  • Use Case 2:
  • Use Case 3:
  • Cautionary Tale: 
  • Do not Repeat:
 
Nitin S
  • Tool 2: Spark and RDMA Spark
  • Spark MLlib for topic modeling for social science data
  • Use Case 3:
  • Cautionary Tale: 
  • Do not Repeat:
 
Reagan M
  • Tool 3: iRODS
  • Cyverse:  national data grid (not just data, but also including data analysis tools, the analyses).  The Discovery Environment integrates distributed data management (iRODS), with workflow management (HTCondor), with application virtualization. (Docker containers).
  • Bayer Pharmaceutical:  they manage petabytes of genomic data.  Contributors are organized in cohorts with their DNA analyzed in specific projects.  Every data product is tagged with the source of the data.  This makes it possible to redact all data products.  A person can leave the cohort (retract their usage agreement), and all of the derived data products can be modified to honor their retraction.
  • NOAA National Climatic Data Center:  They have the challenge of managing ingestion of environmental records from multiple communities.  The data are curated and deposited into an archive.  However, none of the data submitters are allowed to access the archive.  The solution has been to implement an iRODS data grid to manage the submitted data.  Once the curation is completed in the iRODS data grid, the a process within the archive pulls the data into the archive.  This makes it possible to track all submissions, manage the curation, and track the formal accession into the archive.
Nitin S
  • Tool 4: Neo4J
Reagan M
  • QueryArrow is a metadata virtualization mechanism that enables the distribution of metadata across databases, including graph databases, relational databases, NOSQL databases, column oriented databases, etc.  A specific application is the use of NEo4J to manage access controls for distributed data collections.  This capability is being provided in iRODS version 4.3.
Nitin S
  • Tool 5: MALLET
  • Visual data pipeline creation and execution
  • Cloud
Reagan M
  • iRODS (http://irods.org/) - A middleware (software) that provides federated access to data distributed in many storage sites, including authentication, authorization (access control), naming, arrangement, and metadata tagging.  A key component is policy-based data management.  Policy sets can be implemented that manage protected data (PII, PHI, PCI), that implement NSF data management plans, that implement ISO 16363 trustworthiness assessment criteria.
 
  • Data curation tools: i.e. cleansing, Metadata capture
  • Tool 1: 
  • Use Case 1:
 
  • High Performance Computing Tools
Nitin S
  • Spark- MLlib
Wirawan P
  • Use case:
Nitin S
  • Graph Analytics
Wirawan P
  • Use case:
  • Parallel R packages (specific packages = ?)
  • Use case:
 
rrawlings.goss@gmail.com
  • Modeling and Simulation Tools
Nitin S
  • SimGrid
 
Wirawan P
  • Science Gateways
Wirawan P
  • Galaxy?
  • CIPRES
  • SEAGrid (materials science/grid)
  • Gateway builder 1: Airavata (software), SciGaP (cloud platform)
  • Gateway/web app builder 2: OSC Open OnDemand?
 
  •  
 
rrawlings.goss@gmail.com Domain Specific Examples: 
 
  • CI projects in Health: 
  • Use Case(s): Bioinformatics
  • Use Case(s): Precision Medicine
  • Use Case(s): Health Disparities
  • CI projects in Energy and Smart Grid
Nitin S
  • CI projects in CyberSecurity
...

Contact Support



Please check out our How-to Guide and FAQ first to see if your question is already answered! :)

If you have a feature request, please add it to this pad. Thanks!


Log in