Innovation Bridge Program

Collaboration at the Workshop

WORKING TOGETHER

The first part of the workshop will be dedicated to getting a sense of each person’s way of connecting and each organization’s needs and priorities.

Once we settle in together, we’ll break into smaller working groups. Each group will be pre-selected to work on a specific goal. Most of Tuesday (Feb 27th) will be spent working in these groups, with periodic check-ins and report-outs for cross-pollination.

The tentative groups are:

(See below for some more detail on each of these groups)

After the working sessions, the in-person participants will gather for a full-group circle to agree on goals for the entire (~6 month) initiative. The focus of this will likely be on the broader ML/AI opportunities and how we want to leverage Source Cooperative and cloud native geospatial formats, while also setting clear goals for pushing towards a 1.0 of the core field boundary data schema.

On Wednesday the 28th, we’ll make a final push in the working groups to see what we can ‘ship’.

By mid-day we expect to be ‘pens down’ on shipping, and will gather the in-person group for a closing session and discussion of next steps.

Wednesday afternoon, we’ll have remote-friendly sessions that are also open to the wider St. Louis community. During the sessions we will:


Data Schema Background

While the idea of shipping a version 0.1 of a data schema in ~2 day kickoff may seem ambitious, there is a lot of work in the cloud native geospatial ecosystem we’ll be able to leverage. TGE has been funding some key ‘pre-work’ ahead of the workshop to give us a real chance of shipping. The core idea is to adapt many of the lessons we learned in building the SpatioTemporal Asset Catalog (STAC) specification, particularly defining a small core vocabulary with a set of flexible extensions that can evolve independently, with validation tools that ensure compliance against any set of defined extensions. See ‘Towards Flexible Data Schemas‘ (currently a draft google doc blog post, will replace with the actual post when published) for a lot more depth (and feel free to follow the link there to ‘part 1’ on the importance of Schemas & ID’s).

The core vocabulary will likely be quite small – just a geometry, an id and likely some sort of time stamp, but there will likely evolve a set of ‘core extensions’ that will be more commonly used and mature more quickly. A main topic for the workshop will be determining exactly what goes into the core, and defining some key extensions. 

The main ‘pre-work’ ahead of the workshop has been Matthias Mohr establishing a ‘placeholder’ core data schema repository, along with a ‘extension template’ repository that can be easily cloned to start a new extension. So at the workshop we should be able to focus on the core data model and then easily ship a 0.1 version that reflects the current state of thinking (and which will evolve as we shift to virtually collaboration). He’s also built a GeoJSON Validator and a GeoParquet Validatorthat work with flexibly defined schemas, and so we should be able to validate data with the core schema extensions very soon after they are created. 


Proposed Working Groups

Core Schema & ID’s: Define the required and optional attributes that are in the core schema. The main topics of discussion will likely be ID’s and timestamps. For ID’s we’ll learn about  the Global Field ID system Varda has built & the asset-registry project that FAO is leveraging for their EUDR work. And for timestamps we’ll look at the variety of projects in the data survey to try to get to a flexible but opinionated definition of timestamps, and figure out if they should be mandatory or optional. The group will aim to both publish a 0.1 version of the core data schema repository, ideally at the end of Day 1 so that extensions can be based on it. Ideally day 2 works on getting a handful of datasets into the core schema to test out that it works as intended and then evolve as needed.

Extensions: The extensions group will hopefully do an opinionated survey of potential extensions to work on, aiming for a stack-ranked list of the most important ones. In the afternoon they’ll break into subgroups to make progress on one or more, by deciding on attributes and putting them into the extension format – cloning the ‘extension template’ repository. We may have a number of small groups, or may have more people going deep to make one or two high quality extensions. Ideally on day 2 at least some subgroups are able to publish extensions, and get data samples that follow the core schema and the extension, validating with the tools.

AI/ML, Training Data & Provenance: This group will likely start more broadly, with a discussion of the state of field boundary AI/ML ecosystem and how to make it more accessible – tools documentation and start to get to a good list of training data published on github, and potential for harmonizing it. They may break into further subgroups in the afternoon, and hopefully at least some people start on schema / extension for the attributes to use to refer to source data used in tracing and other useful provenance information. Other subgroups may push forward on globally sampled harmonized training data, and/or tools (or at least a roadmap to tools) to make it easier to work with field boundary training data. It’d be awesome on Day 2 to get some clear roadmaps for work on AI/ML for field boundaries, and to also try to convert some training data into the core defined schema and one or more extensions.

Architectures of Participation and Building Momentum – This will be a primarily virtual group (details above), looking to leverage all the interested participants who could not be present. We hope to map out the landscape of organizations who should be involved in this effort, and create a strategy for getting them involved. This could include strategizing on events, outreach and fundraising to build and sustain momentum on the initiative.