A Double Machine Learning Trend Model for Citizen Science Data
Authors:
Daniel Fink,
Alison Johnston,
Matt Strimas-Mackey,
Tom Auer,
Wesley M. Hochachka,
Shawn Ligocki,
Lauren Oldham Jaromczyk,
Orin Robinson,
Chris Wood,
Steve Kelling,
Amanda D. Rodewald
Abstract:
1. Citizen and community-science (CS) datasets have great potential for estimating interannual patterns of population change given the large volumes of data collected globally every year. Yet, the flexible protocols that enable many CS projects to collect large volumes of data typically lack the structure necessary to keep consistent sampling across years. This leads to interannual confounding, as…
▽ More
1. Citizen and community-science (CS) datasets have great potential for estimating interannual patterns of population change given the large volumes of data collected globally every year. Yet, the flexible protocols that enable many CS projects to collect large volumes of data typically lack the structure necessary to keep consistent sampling across years. This leads to interannual confounding, as changes to the observation process over time are confounded with changes in species population sizes.
2. Here we describe a novel modeling approach designed to estimate species population trends while controlling for the interannual confounding common in citizen science data. The approach is based on Double Machine Learning, a statistical framework that uses machine learning methods to estimate population change and the propensity scores used to adjust for confounding discovered in the data. Additionally, we develop a simulation method to identify and adjust for residual confounding missed by the propensity scores. Using this new method, we can produce spatially detailed trend estimates from citizen science data.
3. To illustrate the approach, we estimated species trends using data from the CS project eBird. We used a simulation study to assess the ability of the method to estimate spatially varying trends in the face of real-world confounding. Results showed that the trend estimates distinguished between spatially constant and spatially varying trends at a 27km resolution. There were low error rates on the estimated direction of population change (increasing/decreasing) and high correlations on the estimated magnitude.
4. The ability to estimate spatially explicit trends while accounting for confounding in citizen science data has the potential to fill important information gaps, helping to estimate population trends for species, regions, or seasons without rigorous monitoring data.
△ Less
Submitted 10 May, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
Statistical Inference on Tree Swallow Migrations with Random Forests
Authors:
Tim Coleman,
Lucas Mentch,
Daniel Fink,
Frank La Sorte,
Giles Hooker,
Wesley Hochachka,
David Winkler
Abstract:
Bird species' migratory patterns have typically been studied through individual observations and historical records. In recent years however, the eBird citizen science project, which solicits observations from thousands of bird watchers around the world, has opened the door for a data-driven approach to understanding the large-scale geographical movements. Here, we focus on the North American Tree…
▽ More
Bird species' migratory patterns have typically been studied through individual observations and historical records. In recent years however, the eBird citizen science project, which solicits observations from thousands of bird watchers around the world, has opened the door for a data-driven approach to understanding the large-scale geographical movements. Here, we focus on the North American Tree Swallow (\textit{Tachycineta bicolor}) occurrence patterns throughout the eastern United States. Migratory departure dates for this species are widely believed by both ornithologists and casual observers to vary substantially across years, but the reasons for this are largely unknown. In this work, we present evidence that maximum daily temperature is a major factor influencing Tree Swallow occurrence. Because it is generally understood that species occurrence is a function of many complex, high-order interactions between ecological covariates, we utilize the flexible modeling approach offered by random forests. Making use of recent asymptotic results, we provide formal hypothesis tests for predictive significance various covariates and also develop and implement a permutation-based approach for formally assessing interannual variations by treating the prediction surfaces generated by random forests as functional data. Each of these tests suggest that maximum daily temperature has a significant effect on migration patterns.
△ Less
Submitted 8 November, 2019; v1 submitted 26 October, 2017;
originally announced October 2017.