Reading Source data

Read In CSV Source Data

Our first step will be to read our source data into R. For this example, we’ll be using data for six bearded seal deployments of the northwest coast of Alaska. The data were downloaded as comma-separated files from the Wildlife Computers Data Portal. All data have merged into a single csv file and grouped by a ‘DeployID’ unique identifier. The raw-data/akbs_locs.csv contains all of the Argos location estimates determined for each deployment as well as fastloc GPS location estimates when available. At a minimum, this file includes identifying columns such as DeployID, PTT, Date, Quality, Type, Latitude, and Longitude. If the Argos Kalman filtering approach has been enabled (which it should be for any modern tag deployment), then additional data will be found in columns that describe the error ellipses (e.g. Error Semi-major axis, Error Semi-minor axis, Error Ellipse orientation). Note, if the tag transmitted GPS/FastLoc data, this file will list those records as Type = 'FastGPS'.

In many situations, the original source data may include multiple _*.csv_ or other delimited files. Instead of reading each file in one by one, the purr::map_df() function can be used to cycle through each _*.csv_ files in a directory and pass them to readr::read_csv() and, then, return a merged tibble/data.frame. This was the original case for these Alaska bearded seal data and the following code was used to create the combined _*.csv_ file from several _*-Locations.csv_ files.

library(dplyr)
library(readr)
library(purrr)
library(here)

my_cols <- cols(
  DeployID = col_character(),
  Ptt = col_integer(),
  Instr = col_character(),
  Date = col_datetime("%H:%M:%S %d-%b-%Y"),
  Type = col_character(),
  Quality = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  `Error radius` = col_integer(),
  `Error Semi-major axis` = col_integer(),
  `Error Semi-minor axis` = col_integer(),
  `Error Ellipse orientation` = col_integer(),
  Offset = col_character(),
  `Offset orientation` = col_character(),
  `GPE MSD` = col_character(),
  `GPE U` = col_character(),
  Count = col_character(),
  Comment = col_character()
)

tbl_locs <- list.files(file.path('~/Downloads/akbs_data'),
                       recursive = TRUE, 
                       full.names = TRUE, 
                       pattern = "*-Locations.csv") %>% 
  purrr::map_df( ~ readr::read_csv(.x, col_types = my_cols)) %>% 
  readr::write_csv(file = here::here('raw-data','akbs_locs.csv'))

The readr package includes the read_csv() function which we will rely on to read the csv data into R. This function is very similar to read.csv() but provides some additional functionality and consistency. One important step is to specify the column types for each column in the data set. This saves an additional step post-read where we have to mutate each column type. Note, we also rely on the here package for reliable and consistent file paths. Lastly, the janitor::clean_names() function is used to provide a consistent transformation of column names (e.g. lower and snake case with no spaces).

Efficient Data Storage with Arrow

As previously mentioned, thoughtful data management is an important component of any movement modeling analysis. For some, a central relational database (e.g., PostgreSQL) will be a good solution. However, in many cases, an organized collection of _*.csv_ files or an in-memory database solution (e.g. SQLite, DuckDb) will be a good option. For this example, we’re going to rely on the Apache arrow package for R and a collection of parquet files as our efficient data source.

library(arrow)
library(fs)
dir_out <- here::here("data","akbs_locs_parquet")
fs::dir_delete(dir_out)
fs::dir_create(dir_out)
arrow::write_dataset(akbs_locs, dir_out, partitioning = "deploy_id")
rm(akbs_locs) #remove from memory

If we take a quick look at the director/file structure withing our output directory, data/akbs_locs_parquet, we can see that arrow created a separate subdirectory for each deploy_id and there’s a separate _*.parquet_ file within each of those subdirectories

dir_ls(dir_out) %>% path_rel(here::here("data"))
akbs_locs_parquet/deploy_id=EB2009_3000_06A1346
akbs_locs_parquet/deploy_id=EB2009_3001_06A1332
akbs_locs_parquet/deploy_id=EB2009_3002_06A1357
akbs_locs_parquet/deploy_id=EB2011_3000_10A0219
akbs_locs_parquet/deploy_id=EB2011_3001_10A0552
akbs_locs_parquet/deploy_id=EB2011_3002_10A0200
dir_ls(dir_out, recurse = TRUE, glob = "*.parquet") %>% 
  path_rel(here::here("data"))
akbs_locs_parquet/deploy_id=EB2009_3000_06A1346/part-0.parquet
akbs_locs_parquet/deploy_id=EB2009_3001_06A1332/part-0.parquet
akbs_locs_parquet/deploy_id=EB2009_3002_06A1357/part-0.parquet
akbs_locs_parquet/deploy_id=EB2011_3000_10A0219/part-0.parquet
akbs_locs_parquet/deploy_id=EB2011_3001_10A0552/part-0.parquet
akbs_locs_parquet/deploy_id=EB2011_3002_10A0200/part-0.parquet

Querying Our Data

Now that we have our data organized with arrow and parquet files, we can explore ways the data can be queried and used in our analysis. There are two approaches and package frameworks we can rely on for querying:

  1. The dplyr package’s support for arrow/parquet
  2. Direct use of SQL syntax with duckdb’s arrow/parquet integration

A dplyr example

Two common tasks we might want to perform are to calculate the number of location records by deployment and to filter our data set based on a particular date range.

In this first example we create an open connection with our data set using the arrow::open_dataset() function and passing our dir_out path. Then, basic dplyr functions can return a table of location counts summarized by deploy_id. Note, unlike typical dplyr usage, the collect() function is required to actually return the data set records.

# open the dataset
akbs_ds <- arrow::open_dataset(dir_out) 

# query the dataset to calculate number of locations by deploy_id
akbs_ds %>% 
  count(deploy_id, name="n_locs") %>% 
  arrange(deploy_id) %>% 
  collect()
# A tibble: 6 × 2
  deploy_id           n_locs
  <chr>                <int>
1 EB2009_3000_06A1346   6491
2 EB2009_3001_06A1332   5660
3 EB2009_3002_06A1357   4913
4 EB2011_3000_10A0219   9698
5 EB2011_3001_10A0552   7559
6 EB2011_3002_10A0200  10197

In this situation, we want to just look at locations from the months of August, September, and October. For this, the month() function from the lubridate package provides a way for us to filter by month integer.

# query the data set to only include the months of August, 
# September, and October
akbs_ds %>% 
  filter(month(date) %in% c(8,9,10)) %>% 
  relocate(deploy_id) %>% 
  collect()
# A tibble: 17,597 × 18
   deploy_id      ptt instr date                type  quality latitude longitude
 * <chr>        <dbl> <chr> <dttm>              <chr> <chr>      <dbl>     <dbl>
 1 EB2009_3002… 74630 Mk10  2009-08-01 00:04:10 Fast… 8           69.2     -164.
 2 EB2009_3002… 74630 Mk10  2009-08-01 02:30:30 Argos B           69.2     -164.
 3 EB2009_3002… 74630 Mk10  2009-08-01 02:59:03 Argos B           69.2     -164.
 4 EB2009_3002… 74630 Mk10  2009-08-01 03:26:08 Argos B           69.2     -164.
 5 EB2009_3002… 74630 Mk10  2009-08-01 03:37:23 Argos B           69.2     -164.
 6 EB2009_3002… 74630 Mk10  2009-08-01 04:37:09 Argos B           69.2     -164.
 7 EB2009_3002… 74630 Mk10  2009-08-01 05:06:50 Argos B           69.2     -164.
 8 EB2009_3002… 74630 Mk10  2009-08-01 05:49:57 Argos B           69.2     -164.
 9 EB2009_3002… 74630 Mk10  2009-08-01 06:17:06 Argos B           69.2     -164.
10 EB2009_3002… 74630 Mk10  2009-08-01 06:23:29 Fast… 10          69.2     -164.
# … with 17,587 more rows, and 10 more variables: error_radius <dbl>,
#   error_semi_major_axis <dbl>, error_semi_minor_axis <dbl>,
#   error_ellipse_orientation <dbl>, offset <lgl>, offset_orientation <lgl>,
#   gpe_msd <lgl>, gpe_u <lgl>, count <lgl>, comment <chr>

A duckdb example

DuckDB is a powerful in-process database management system that is easily installed as an R package and provides built-in support for our arrow/parquet data. With the duckdb package as well as DBI, we can pass standard SQL code to query our data set. This is especially useful if you are familiar/comfortable with SQL and you want to develop some fairly complex queries for your data.

library(duckdb)
# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

# register the data set as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "akbs_locs", akbs_ds)

# query
res <- dbGetQuery(con, 
           "SELECT deploy_id, COUNT(*) AS n_locs 
           FROM akbs_locs 
           GROUP BY deploy_id")

# clean up
duckdb_unregister(con, "akbs_locs")
dbDisconnect(con, shutdown = TRUE)

tibble(res)
# A tibble: 6 × 2
  deploy_id           n_locs
  <chr>                <dbl>
1 EB2011_3000_10A0219   9698
2 EB2009_3000_06A1346   6491
3 EB2009_3001_06A1332   5660
4 EB2009_3002_06A1357   4913
5 EB2011_3001_10A0552   7559
6 EB2011_3002_10A0200  10197

An example of a more complex SQL query that might be of interest is to identify records with duplicate date-time columns within each deployment. We can do this with DuckDB and SQL.

con <- dbConnect(duckdb::duckdb())

# register the data set as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "akbs_locs", akbs_ds)

# query
res <- dbGetQuery(con, 
           "SELECT a.deploy_id, a.date, a.quality,
           a.latitude, a.longitude, a.error_radius
            FROM akbs_locs a
JOIN (SELECT deploy_id, date, COUNT(*),
FROM akbs_locs 
GROUP BY deploy_id, date
HAVING count(*) > 1 ) b
ON a.deploy_id = b.deploy_id AND
a.date = b.date
ORDER BY a.deploy_id, a.date")

# clean up
duckdb_unregister(con, "akbs_locs")
dbDisconnect(con, shutdown = TRUE)

tibble(res)
# A tibble: 4,820 × 6
   deploy_id         date                quality latitude longitude error_radius
   <chr>             <dttm>              <chr>      <dbl>     <dbl>        <dbl>
 1 EB2009_3000_06A1… 2009-06-25 02:21:37 A           66.5     -163.          210
 2 EB2009_3000_06A1… 2009-06-25 02:21:37 A           66.5     -163.          461
 3 EB2009_3000_06A1… 2009-06-28 01:10:04 B           66.7     -163.         2154
 4 EB2009_3000_06A1… 2009-06-28 01:10:04 B           66.8     -163.          661
 5 EB2009_3000_06A1… 2009-06-30 01:24:03 B           67.6     -164.         5687
 6 EB2009_3000_06A1… 2009-06-30 01:24:03 B           67.6     -164.          398
 7 EB2009_3000_06A1… 2009-07-02 09:34:28 B           69.1     -166.         1736
 8 EB2009_3000_06A1… 2009-07-02 09:34:28 B           69.1     -166.         2475
 9 EB2009_3000_06A1… 2009-07-02 15:54:47 B           69.2     -166.         1246
10 EB2009_3000_06A1… 2009-07-02 15:54:47 B           69.2     -166.          522
# … with 4,810 more rows

If we look at the above table, we can see the duplicate date column values but also notice the coordinate values are different. Instead of just removing one of the duplicates at random, we can use complex SQL to retain the duplicate value with the lower error_radius.

con <- dbConnect(duckdb::duckdb())

# register the data set as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "akbs_locs", akbs_ds)

# query
res <- dbGetQuery(con,
           "select a.*
            from
            (select *
            ,ROW_NUMBER() over (
            partition by deploy_id, date
            order by error_radius ASC) rn 
            from akbs_locs) a
            where a.rn = 1
            order by deploy_id, date, error_radius")

# clean up
duckdb_unregister(con, "akbs_locs")
dbDisconnect(con, shutdown = TRUE)

tibble(res)
# A tibble: 42,047 × 19
     ptt instr date                type  quality latitude longitude error_radius
   <dbl> <chr> <dttm>              <chr> <chr>      <dbl>     <dbl>        <dbl>
 1 74627 Mk10  2009-06-23 00:00:00 User  <NA>        66.4     -162.           NA
 2 74627 Mk10  2009-06-23 02:41:12 Argos A           66.3     -163.          476
 3 74627 Mk10  2009-06-23 03:07:11 Argos B           66.4     -162.          658
 4 74627 Mk10  2009-06-23 04:08:10 Argos B           66.3     -162.         5178
 5 74627 Mk10  2009-06-23 04:15:36 Argos B           66.3     -162.         3877
 6 74627 Mk10  2009-06-23 04:27:12 Argos A           66.3     -162.          157
 7 74627 Mk10  2009-06-23 04:44:54 Argos B           66.3     -162.          432
 8 74627 Mk10  2009-06-23 05:56:58 Argos A           66.3     -163.          174
 9 74627 Mk10  2009-06-23 06:24:33 Argos A           66.4     -162.          262
10 74627 Mk10  2009-06-23 07:35:42 Argos B           66.4     -162.          972
# … with 42,037 more rows, and 11 more variables: error_semi_major_axis <dbl>,
#   error_semi_minor_axis <dbl>, error_ellipse_orientation <dbl>, offset <lgl>,
#   offset_orientation <lgl>, gpe_msd <lgl>, gpe_u <lgl>, count <lgl>,
#   comment <chr>, deploy_id <chr>, rn <dbl>

Create Our Location Dataset

We’re going to rely on the typical dplyr approach to query our source data and distill it down to only the essential columns needed for our analysis. Also, since we have a combination of Argos and FastGPS data, we need to create a separate column for the number of satellites used during the fastloc GPS calculation.

akbs_ds <- arrow::open_dataset(dir_out)

akbs_locs <- akbs_ds %>% 
  dplyr::select(deploy_id,
                date,
                type,
                quality,
                latitude,
                longitude,
                error_radius,
                error_semi_major_axis,
                error_semi_minor_axis,
                error_ellipse_orientation) %>% 
  mutate(
    num_sats = case_when(
      type == "FastGPS" ~ quality,
      TRUE ~ NA_character_
      ),
    quality = case_when(
      type == "FastGPS" ~ NA_character_,
      TRUE ~ quality
    ),
    num_sats = as.integer(num_sats)
    ) %>% 
  collect()

Remove Duplicate Date-Time Records

It’s not unusual for a small number of records within a deployment to have duplicate times. We will want to identify and remove thiese records early in the process.

akbs_locs <- akbs_locs %>% 
  group_by(deploy_id) %>% 
  arrange(date, error_radius) %>% 
  mutate(
    rank = 1L,
    rank = case_when(duplicated(date, fromLast = FALSE) ~
                              lag(rank) + 1L, TRUE ~ rank)) %>% 
  dplyr::filter(rank == 1)

Convert to Spatial Feature

Since our data are, at the core, spatial observations we will benefit by converting our akbs_locs tibble/data.frame into a spatial point feature with the sfpackage. In order to accomplish this, we need to identify our coordinate columns and know the coordinate reference system (CRS). For all Argos and GPS data, the coordinate reference system is geographic with X and Y represented as longitude (-180, 180) and latitude (-90, 90). You should ensure the data are formatted as decimal degrees and not some combination of degrees, minutes, and seconds. To specify the CRS, we’ll rely on the designated EPSG code of 4326 which tells sf that our data are geographic.

library(sf)

akbs_locs <- sf::st_as_sf(akbs_locs, coords = c("longitude","latitude"),
                          crs = "epsg:4326")

akbs_locs %>% 
  dplyr::select(deploy_id, date, geometry)
Simple feature collection with 42047 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -175.137 ymin: -15.7054 xmax: 170.878 ymax: 73.478
Geodetic CRS:  WGS 84
# A tibble: 42,047 × 3
# Groups:   deploy_id [6]
   deploy_id           date                           geometry
   <chr>               <dttm>                      <POINT [°]>
 1 EB2009_3000_06A1346 2009-06-23 00:00:00     (-162.42 66.38)
 2 EB2009_3000_06A1346 2009-06-23 02:41:12 (-162.5562 66.3168)
 3 EB2009_3000_06A1346 2009-06-23 03:07:11 (-162.4431 66.3656)
 4 EB2009_3000_06A1346 2009-06-23 04:08:10 (-162.3752 66.3216)
 5 EB2009_3000_06A1346 2009-06-23 04:15:36 (-162.3757 66.3222)
 6 EB2009_3000_06A1346 2009-06-23 04:27:12  (-162.3922 66.345)
 7 EB2009_3000_06A1346 2009-06-23 04:44:54 (-162.4411 66.3316)
 8 EB2009_3000_06A1346 2009-06-23 05:56:58 (-162.5032 66.3487)
 9 EB2009_3000_06A1346 2009-06-23 06:24:33 (-162.4993 66.3518)
10 EB2009_3000_06A1346 2009-06-23 07:35:42 (-162.4958 66.3583)
# … with 42,037 more rows

We can see that instead of separate columns for longitude and latitude we now have a single geometry column that describes the point geometry of each location. Also note that the metadata indicates the Geodetic CRS is specified as WGS 84 which tells is our CRS specification is correctly set to geographic with longitude/latitude coordinate values.