library(dplyr)
library(readr)
library(purrr)
library(here)
<- cols(
my_cols DeployID = col_character(),
Ptt = col_integer(),
Instr = col_character(),
Date = col_datetime("%H:%M:%S %d-%b-%Y"),
Type = col_character(),
Quality = col_character(),
Latitude = col_double(),
Longitude = col_double(),
`Error radius` = col_integer(),
`Error Semi-major axis` = col_integer(),
`Error Semi-minor axis` = col_integer(),
`Error Ellipse orientation` = col_integer(),
Offset = col_character(),
`Offset orientation` = col_character(),
`GPE MSD` = col_character(),
`GPE U` = col_character(),
Count = col_character(),
Comment = col_character()
)
<- list.files(file.path('~/Downloads/akbs_data'),
tbl_locs recursive = TRUE,
full.names = TRUE,
pattern = "*-Locations.csv") %>%
::map_df( ~ readr::read_csv(.x, col_types = my_cols)) %>%
purrr::write_csv(file = here::here('raw-data','akbs_locs.csv')) readr
Reading Source data
Read In CSV Source Data
Our first step will be to read our source data into R. For this example, we’ll be using data for six bearded seal deployments of the northwest coast of Alaska. The data were downloaded as comma-separated files from the Wildlife Computers Data Portal. All data have merged into a single csv file and grouped by a ‘DeployID’ unique identifier. The raw-data/akbs_locs.csv
contains all of the Argos location estimates determined for each deployment as well as fastloc GPS location estimates when available. At a minimum, this file includes identifying columns such as DeployID
, PTT
, Date
, Quality
, Type
, Latitude
, and Longitude
. If the Argos Kalman filtering approach has been enabled (which it should be for any modern tag deployment), then additional data will be found in columns that describe the error ellipses (e.g. Error Semi-major axis
, Error Semi-minor axis
, Error Ellipse orientation
). Note, if the tag transmitted GPS/FastLoc data, this file will list those records as Type = 'FastGPS'
.
In many situations, the original source data may include multiple _*.csv_ or other delimited files. Instead of reading each file in one by one, the purr::map_df()
function can be used to cycle through each _*.csv_ files in a directory and pass them to readr::read_csv()
and, then, return a merged tibble/data.frame. This was the original case for these Alaska bearded seal data and the following code was used to create the combined _*.csv_ file from several _*-Locations.csv_ files.
The readr
package includes the read_csv()
function which we will rely on to read the csv data into R. This function is very similar to read.csv()
but provides some additional functionality and consistency. One important step is to specify the column types for each column in the data set. This saves an additional step post-read where we have to mutate each column type. Note, we also rely on the here
package for reliable and consistent file paths. Lastly, the janitor::clean_names()
function is used to provide a consistent transformation of column names (e.g. lower and snake case with no spaces).
Efficient Data Storage with Arrow
As previously mentioned, thoughtful data management is an important component of any movement modeling analysis. For some, a central relational database (e.g., PostgreSQL) will be a good solution. However, in many cases, an organized collection of _*.csv_ files or an in-memory database solution (e.g. SQLite, DuckDb) will be a good option. For this example, we’re going to rely on the Apache arrow
package for R and a collection of parquet
files as our efficient data source.
library(arrow)
library(fs)
<- here::here("data","akbs_locs_parquet")
dir_out ::dir_delete(dir_out)
fs::dir_create(dir_out)
fs::write_dataset(akbs_locs, dir_out, partitioning = "deploy_id")
arrowrm(akbs_locs) #remove from memory
If we take a quick look at the director/file structure withing our output directory, data/akbs_locs_parquet, we can see that arrow
created a separate subdirectory for each deploy_id
and there’s a separate _*.parquet_ file within each of those subdirectories
dir_ls(dir_out) %>% path_rel(here::here("data"))
akbs_locs_parquet/deploy_id=EB2009_3000_06A1346
akbs_locs_parquet/deploy_id=EB2009_3001_06A1332
akbs_locs_parquet/deploy_id=EB2009_3002_06A1357
akbs_locs_parquet/deploy_id=EB2011_3000_10A0219
akbs_locs_parquet/deploy_id=EB2011_3001_10A0552
akbs_locs_parquet/deploy_id=EB2011_3002_10A0200
dir_ls(dir_out, recurse = TRUE, glob = "*.parquet") %>%
path_rel(here::here("data"))
akbs_locs_parquet/deploy_id=EB2009_3000_06A1346/part-0.parquet
akbs_locs_parquet/deploy_id=EB2009_3001_06A1332/part-0.parquet
akbs_locs_parquet/deploy_id=EB2009_3002_06A1357/part-0.parquet
akbs_locs_parquet/deploy_id=EB2011_3000_10A0219/part-0.parquet
akbs_locs_parquet/deploy_id=EB2011_3001_10A0552/part-0.parquet
akbs_locs_parquet/deploy_id=EB2011_3002_10A0200/part-0.parquet
Querying Our Data
Now that we have our data organized with arrow
and parquet
files, we can explore ways the data can be queried and used in our analysis. There are two approaches and package frameworks we can rely on for querying:
- The
dplyr
package’s support forarrow
/parquet
- Direct use of SQL syntax with
duckdb
’sarrow
/parquet
integration
A dplyr
example
Two common tasks we might want to perform are to calculate the number of location records by deployment and to filter our data set based on a particular date range.
In this first example we create an open connection with our data set using the arrow::open_dataset()
function and passing our dir_out
path. Then, basic dplyr
functions can return a table of location counts summarized by deploy_id
. Note, unlike typical dplyr
usage, the collect()
function is required to actually return the data set records.
# open the dataset
<- arrow::open_dataset(dir_out)
akbs_ds
# query the dataset to calculate number of locations by deploy_id
%>%
akbs_ds count(deploy_id, name="n_locs") %>%
arrange(deploy_id) %>%
collect()
# A tibble: 6 × 2
deploy_id n_locs
<chr> <int>
1 EB2009_3000_06A1346 6491
2 EB2009_3001_06A1332 5660
3 EB2009_3002_06A1357 4913
4 EB2011_3000_10A0219 9698
5 EB2011_3001_10A0552 7559
6 EB2011_3002_10A0200 10197
In this situation, we want to just look at locations from the months of August, September, and October. For this, the month()
function from the lubridate
package provides a way for us to filter by month integer.
# query the data set to only include the months of August,
# September, and October
%>%
akbs_ds filter(month(date) %in% c(8,9,10)) %>%
relocate(deploy_id) %>%
collect()
# A tibble: 17,597 × 18
deploy_id ptt instr date type quality latitude longitude
* <chr> <dbl> <chr> <dttm> <chr> <chr> <dbl> <dbl>
1 EB2009_3002… 74630 Mk10 2009-08-01 00:04:10 Fast… 8 69.2 -164.
2 EB2009_3002… 74630 Mk10 2009-08-01 02:30:30 Argos B 69.2 -164.
3 EB2009_3002… 74630 Mk10 2009-08-01 02:59:03 Argos B 69.2 -164.
4 EB2009_3002… 74630 Mk10 2009-08-01 03:26:08 Argos B 69.2 -164.
5 EB2009_3002… 74630 Mk10 2009-08-01 03:37:23 Argos B 69.2 -164.
6 EB2009_3002… 74630 Mk10 2009-08-01 04:37:09 Argos B 69.2 -164.
7 EB2009_3002… 74630 Mk10 2009-08-01 05:06:50 Argos B 69.2 -164.
8 EB2009_3002… 74630 Mk10 2009-08-01 05:49:57 Argos B 69.2 -164.
9 EB2009_3002… 74630 Mk10 2009-08-01 06:17:06 Argos B 69.2 -164.
10 EB2009_3002… 74630 Mk10 2009-08-01 06:23:29 Fast… 10 69.2 -164.
# … with 17,587 more rows, and 10 more variables: error_radius <dbl>,
# error_semi_major_axis <dbl>, error_semi_minor_axis <dbl>,
# error_ellipse_orientation <dbl>, offset <lgl>, offset_orientation <lgl>,
# gpe_msd <lgl>, gpe_u <lgl>, count <lgl>, comment <chr>
A duckdb
example
DuckDB is a powerful in-process database management system that is easily installed as an R package and provides built-in support for our arrow
/parquet
data. With the duckdb
package as well as DBI
, we can pass standard SQL code to query our data set. This is especially useful if you are familiar/comfortable with SQL and you want to develop some fairly complex queries for your data.
library(duckdb)
# open connection to DuckDB
<- dbConnect(duckdb::duckdb())
con
# register the data set as a DuckDB table, and give it a name
::duckdb_register_arrow(con, "akbs_locs", akbs_ds)
duckdb
# query
<- dbGetQuery(con,
res "SELECT deploy_id, COUNT(*) AS n_locs
FROM akbs_locs
GROUP BY deploy_id")
# clean up
duckdb_unregister(con, "akbs_locs")
dbDisconnect(con, shutdown = TRUE)
tibble(res)
# A tibble: 6 × 2
deploy_id n_locs
<chr> <dbl>
1 EB2011_3000_10A0219 9698
2 EB2009_3000_06A1346 6491
3 EB2009_3001_06A1332 5660
4 EB2009_3002_06A1357 4913
5 EB2011_3001_10A0552 7559
6 EB2011_3002_10A0200 10197
An example of a more complex SQL query that might be of interest is to identify records with duplicate date-time columns within each deployment. We can do this with DuckDB and SQL.
<- dbConnect(duckdb::duckdb())
con
# register the data set as a DuckDB table, and give it a name
::duckdb_register_arrow(con, "akbs_locs", akbs_ds)
duckdb
# query
<- dbGetQuery(con,
res "SELECT a.deploy_id, a.date, a.quality,
a.latitude, a.longitude, a.error_radius
FROM akbs_locs a
JOIN (SELECT deploy_id, date, COUNT(*),
FROM akbs_locs
GROUP BY deploy_id, date
HAVING count(*) > 1 ) b
ON a.deploy_id = b.deploy_id AND
a.date = b.date
ORDER BY a.deploy_id, a.date")
# clean up
duckdb_unregister(con, "akbs_locs")
dbDisconnect(con, shutdown = TRUE)
tibble(res)
# A tibble: 4,820 × 6
deploy_id date quality latitude longitude error_radius
<chr> <dttm> <chr> <dbl> <dbl> <dbl>
1 EB2009_3000_06A1… 2009-06-25 02:21:37 A 66.5 -163. 210
2 EB2009_3000_06A1… 2009-06-25 02:21:37 A 66.5 -163. 461
3 EB2009_3000_06A1… 2009-06-28 01:10:04 B 66.7 -163. 2154
4 EB2009_3000_06A1… 2009-06-28 01:10:04 B 66.8 -163. 661
5 EB2009_3000_06A1… 2009-06-30 01:24:03 B 67.6 -164. 5687
6 EB2009_3000_06A1… 2009-06-30 01:24:03 B 67.6 -164. 398
7 EB2009_3000_06A1… 2009-07-02 09:34:28 B 69.1 -166. 1736
8 EB2009_3000_06A1… 2009-07-02 09:34:28 B 69.1 -166. 2475
9 EB2009_3000_06A1… 2009-07-02 15:54:47 B 69.2 -166. 1246
10 EB2009_3000_06A1… 2009-07-02 15:54:47 B 69.2 -166. 522
# … with 4,810 more rows
If we look at the above table, we can see the duplicate date
column values but also notice the coordinate values are different. Instead of just removing one of the duplicates at random, we can use complex SQL to retain the duplicate value with the lower error_radius
.
<- dbConnect(duckdb::duckdb())
con
# register the data set as a DuckDB table, and give it a name
::duckdb_register_arrow(con, "akbs_locs", akbs_ds)
duckdb
# query
<- dbGetQuery(con,
res "select a.*
from
(select *
,ROW_NUMBER() over (
partition by deploy_id, date
order by error_radius ASC) rn
from akbs_locs) a
where a.rn = 1
order by deploy_id, date, error_radius")
# clean up
duckdb_unregister(con, "akbs_locs")
dbDisconnect(con, shutdown = TRUE)
tibble(res)
# A tibble: 42,047 × 19
ptt instr date type quality latitude longitude error_radius
<dbl> <chr> <dttm> <chr> <chr> <dbl> <dbl> <dbl>
1 74627 Mk10 2009-06-23 00:00:00 User <NA> 66.4 -162. NA
2 74627 Mk10 2009-06-23 02:41:12 Argos A 66.3 -163. 476
3 74627 Mk10 2009-06-23 03:07:11 Argos B 66.4 -162. 658
4 74627 Mk10 2009-06-23 04:08:10 Argos B 66.3 -162. 5178
5 74627 Mk10 2009-06-23 04:15:36 Argos B 66.3 -162. 3877
6 74627 Mk10 2009-06-23 04:27:12 Argos A 66.3 -162. 157
7 74627 Mk10 2009-06-23 04:44:54 Argos B 66.3 -162. 432
8 74627 Mk10 2009-06-23 05:56:58 Argos A 66.3 -163. 174
9 74627 Mk10 2009-06-23 06:24:33 Argos A 66.4 -162. 262
10 74627 Mk10 2009-06-23 07:35:42 Argos B 66.4 -162. 972
# … with 42,037 more rows, and 11 more variables: error_semi_major_axis <dbl>,
# error_semi_minor_axis <dbl>, error_ellipse_orientation <dbl>, offset <lgl>,
# offset_orientation <lgl>, gpe_msd <lgl>, gpe_u <lgl>, count <lgl>,
# comment <chr>, deploy_id <chr>, rn <dbl>
Create Our Location Dataset
We’re going to rely on the typical dplyr
approach to query our source data and distill it down to only the essential columns needed for our analysis. Also, since we have a combination of Argos and FastGPS data, we need to create a separate column for the number of satellites used during the fastloc GPS calculation.
<- arrow::open_dataset(dir_out)
akbs_ds
<- akbs_ds %>%
akbs_locs ::select(deploy_id,
dplyr
date,
type,
quality,
latitude,
longitude,
error_radius,
error_semi_major_axis,
error_semi_minor_axis,%>%
error_ellipse_orientation) mutate(
num_sats = case_when(
== "FastGPS" ~ quality,
type TRUE ~ NA_character_
),quality = case_when(
== "FastGPS" ~ NA_character_,
type TRUE ~ quality
),num_sats = as.integer(num_sats)
%>%
) collect()
Remove Duplicate Date-Time Records
It’s not unusual for a small number of records within a deployment to have duplicate times. We will want to identify and remove thiese records early in the process.
<- akbs_locs %>%
akbs_locs group_by(deploy_id) %>%
arrange(date, error_radius) %>%
mutate(
rank = 1L,
rank = case_when(duplicated(date, fromLast = FALSE) ~
lag(rank) + 1L, TRUE ~ rank)) %>%
::filter(rank == 1) dplyr
Convert to Spatial Feature
Since our data are, at the core, spatial observations we will benefit by converting our akbs_locs
tibble/data.frame into a spatial point feature with the sf
package. In order to accomplish this, we need to identify our coordinate columns and know the coordinate reference system (CRS). For all Argos and GPS data, the coordinate reference system is geographic with X and Y represented as longitude (-180, 180) and latitude (-90, 90). You should ensure the data are formatted as decimal degrees and not some combination of degrees, minutes, and seconds. To specify the CRS, we’ll rely on the designated EPSG code of 4326 which tells sf
that our data are geographic.
library(sf)
<- sf::st_as_sf(akbs_locs, coords = c("longitude","latitude"),
akbs_locs crs = "epsg:4326")
%>%
akbs_locs ::select(deploy_id, date, geometry) dplyr
Simple feature collection with 42047 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -175.137 ymin: -15.7054 xmax: 170.878 ymax: 73.478
Geodetic CRS: WGS 84
# A tibble: 42,047 × 3
# Groups: deploy_id [6]
deploy_id date geometry
<chr> <dttm> <POINT [°]>
1 EB2009_3000_06A1346 2009-06-23 00:00:00 (-162.42 66.38)
2 EB2009_3000_06A1346 2009-06-23 02:41:12 (-162.5562 66.3168)
3 EB2009_3000_06A1346 2009-06-23 03:07:11 (-162.4431 66.3656)
4 EB2009_3000_06A1346 2009-06-23 04:08:10 (-162.3752 66.3216)
5 EB2009_3000_06A1346 2009-06-23 04:15:36 (-162.3757 66.3222)
6 EB2009_3000_06A1346 2009-06-23 04:27:12 (-162.3922 66.345)
7 EB2009_3000_06A1346 2009-06-23 04:44:54 (-162.4411 66.3316)
8 EB2009_3000_06A1346 2009-06-23 05:56:58 (-162.5032 66.3487)
9 EB2009_3000_06A1346 2009-06-23 06:24:33 (-162.4993 66.3518)
10 EB2009_3000_06A1346 2009-06-23 07:35:42 (-162.4958 66.3583)
# … with 42,037 more rows
We can see that instead of separate columns for longitude
and latitude
we now have a single geometry
column that describes the point geometry of each location. Also note that the metadata indicates the Geodetic CRS is specified as WGS 84 which tells is our CRS specification is correctly set to geographic with longitude/latitude coordinate values.