Topic 4.5 - 数据连接¶

1. 数据连接的概念¶

在数据分析的实际应用中，我们很少只使用单一的数据框，通常需要将来自多个数据源的数据合并在一起，这时候就需要数据连接。

数据连接的概念其实并不复杂，我们通过以下例子来理解一下：

假设我们有以下这两个数据表，学生表和班级表
如果我们想得知每个学生的班主任名称，只需要将左表和右表，按照连接键 班级 来连接起来就可以了：

连接的结果是一个新的数据表，包含了左表的所有列和根据右表拓展出来的 班主任 列

具体来说，数据连接的类型主要有两种：

变异连接（Mutating joins）：从一个表向另一个表添加变量，具体来说，变异连接又分为以下几种：
- 左连接 — 保留左表的所有行，并从右表添加匹配的列 - R 中的 dplyr 包中的 left_join()
- 右连接 — 保留右表的所有行，并从左表添加匹配的列 - R 中的 dplyr 包中的 right_join()
- 内连接 — 只保留两个表中连接键匹配的行，并合并它们的列 - R 中的 dplyr 包中的 inner_join()
- 全连接 — 保留两个表中的所有行，并合并它们的列，缺失值用 NA 填充 - R 中的 dplyr 包中的 full_join()
筛选连接（Filtering joins）：根据另一个表中的匹配情况筛选观测值，具体来说，筛选连接又分为以下两种：
- 半连接 — 只保留左表中连接键在右表中存在的行 - R 中的 dplyr 包中的 semi_join()
- 反连接 — 只保留左表中连接键在右表中不存在的行 - R 中的 dplyr 包中的 anti_join()

接下来，在我们详细地介绍这些连接类型之前，我们先来导入本节使用的包和数据：

library(tidyverse) # 加载 tidyverse，用于数据操作和可视化
library(scales) # 加载 scales，用于坐标轴格式化
library(ggokabeito) # 加载 ggokabeito，色盲友好的调色板
library(ggthemes) # 加载 ggthemes，提供额外的 ggplot 主题
library(patchwork) # 加载 patchwork，用于组合多个 ggplot 图形
library(stringr) # 加载 stringr，用于字符串操作

# 读入 2024 年 ASX 200 数据集，并将财年（fyear）列转换为字符类型，以适配后续表连接需求
asx_200_2024 <- read_csv("asx_200_2024.csv") |>
    mutate(fyear = as.character(fyear))

# 读入宽格式版本的 ASX 200 财务数据 - 只有 ASX 200 指数成分公司的财务数据
financials_messy <- read_csv("financials_messy.csv")

# 读入宽格式版本的 ASX 价格数据 - 全市场近 2000 家公司的价格数据
prices_wide <- read_csv("prices_wide.csv")

# 读入混乱格式版本的 ASX 价格数据 - 全市场近 2000 家公司的价格数据
prices_messy <- read_csv("prices_messy.csv")

# 读入 GICS 行业查找表 - 澳洲的行业分类
gics_industry <- read_csv("GICS_industry.csv")

# 将宽格式的价格数据转换为长格式，以便后续分析和连接
prices_tidy <- prices_wide |>
  pivot_longer(
    cols = c("2021", "2022", "2023", "2024"),
    names_to = "fyear",
    values_to = "price"
  )

2. 变异连接（Mutating joins）¶

(1) 变异连接的四种类型¶

变异连接的四种类型，分别是内连接、左连接、右连接、全连接：

内连接：只保留两个表中连接键匹配的行，所有不匹配的行都会被丢弃：

左连接：以左表为主，保留左表的所有行，在左表但是不在右表中的行会被填充为 NA，在右表但是不在左表中的行会被丢弃：

右连接：以右表为主，保留右表的所有行，在右表但是不在左表中的行会被填充为 NA，在左表但是不在右表中的行会被丢弃，其实就是左连接的对称操作：

全连接：保留两个表中所有的行，对于没有匹配的行会填充为 NA，其实就是左连接和右连接的结合：

与数据连接相关的一组概念叫做主键和外键：

主键：
- 一个表中的一个或多个列，其值唯一地标识表中的每一行
- 例如，在以下连接的例子中，学生表中的 姓名 就是学生表的主键，班级表中的 班级 就是班级表的主键
- 主键的一个要求是必须唯一的，也就是说在一个表中，主键列中的值不能重复，否则就无法正确地进行连接
- 主键的另一个要求是不能有缺失值，也就是说在一个表中，主键列中的值不能为 NA，否则就无法正确地进行连接
外键：
- 一个表中的一个或多个列，其值引用另一个表中的主键
- 例如，在以下连接的例子中，在学生表中，班级 列就是一个外键，因为它引用了班级表中的 班级 主键

我们在 R 中使用 dplyr 包中的连接函数来实现这些连接类型：

我们想将公司财务数据和行业分类数据连接起来，首先我们先看一下这两个数据集的结构：

# 从财务数据中选择公司代码、公司名称、财年、GICS 行业代码这四列，创建一个新的数据框 `firms`
firms <- asx_200_2024 |> 
  select(gvkey, conm, fyear, gind)

# 预览
firms |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | gind <dbl> |
|-------------|------------------------|-------------|------------|
| 013312      | BHP GROUP LTD          | 2024        | 151040     |
| 210216      | TELSTRA GROUP LIMITED  | 2024        | 501010     |
| 223003      | CSL LTD                | 2024        | 352010     |
| 212650      | TRANSURBAN GROUP       | 2024        | 203050     |
| 100894      | WOOLWORTHS GROUP LTD   | 2024        | 301010     |
| 212427      | FORTESCUE LTD          | 2024        | 151040     |
| 101601      | WESFARMERS LTD         | 2024        | 255030     |
| 226744      | RAMSAY HEALTH CARE LTD | 2024        | 351020     |
| 220244      | QANTAS AIRWAYS LTD     | 2024        | 203020     |
| 017525      | ORIGIN ENERGY LTD      | 2024        | 551010     |

# 预览 GICS 行业查找表
gics_industry |> slice_head(n = 10)

| gind <dbl> | industry <chr>              |
|------------|-----------------------------|
| 101010     | Energy Equipment & Services |
| 101020     | Oil, Gas & Consumable Fuels |
| 151010     | Chemicals                   |
| 151020     | Construction Materials      |
| 151030     | Containers & Packaging      |
| 151040     | Metals & Mining             |
| 151050     | Paper & Forest Products     |
| 201010     | Aerospace & Defense         |
| 201020     | Building Products           |
| 201030     | Construction & Engineering  |

在进行连接之前，必须验证主键是否唯一，以及键列中是否有缺失值：

# 检查 firms 中 gvkey 是否唯一，即是否有重复的公司代码，
# 如果没有，则 count(gvkey) 会返回一个数据框有两列：gvkey 和 n，其中 n 列表示每个 gvkey 出现的次数，如果所有 gvkey 都是唯一的，那么 n 列的值都应该是 1
# 因此 filter(n > 1) 会筛选出 n 大于 1 的行，如果没有这样的行，则会返回一个空的数据框
firms |>
  count(gvkey) |>
  filter(n > 1)

| gvkey <chr> | n <int> |
|-------------|---------|

# 检查 gics_industry 中 gind 是否唯一，即是否有重复的 GICS 行业代码
# 如果没有，则 count(gind) 会返回一个数据框有两列：gind 和 n，其中 n 列表示每个 gind 出现的次数，如果所有 gind 都是唯一的，那么 n 列的值都应该是 1
# 因此 filter(n > 1) 会筛选出 n 大于 1
gics_industry |>
  count(gind) |>
  filter(n > 1)

| gind <dbl> | n <int> |
|------------|---------|

# 检查 firms 中 gvkey 是否有缺失值
# filter(is.na(gvkey)) 会筛选出 gvkey 列中值为 NA 的行，如果没有这样的行，则会返回一个空的数据框
firms |>
  filter(is.na(gvkey))

| gvkey <chr> | conm <chr> | fyear <chr> | gind <dbl> |
|-------------|------------|-------------|------------|

# 检查 gics_industry 中 gind 是否有缺失值
# filter(is.na(gind)) 会筛选出 gind 列中值为 NA 的行，如果没有这样的行，则会返回一个空的数据框
gics_industry |>
  filter(is.na(gind))

| gind <dbl> | industry <chr> |
|------------|----------------|

首先，我们将公司表和行业表进行内连接：我们只保留两个表中连接键匹配的行，并合并它们的列：

firms_ind_inner <- firms |>
  inner_join(gics_industry, by = join_by(gind))

firms_ind_inner |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | gind <dbl> | industry <chr>                                        |
|-------------|------------------------|-------------|------------|-------------------------------------------------------|
| 013312      | BHP GROUP LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 210216      | TELSTRA GROUP LIMITED  | 2024        | 501010     | Diversified Telecommunication Services                |
| 223003      | CSL LTD                | 2024        | 352010     | Biotechnology                                         |
| 212650      | TRANSURBAN GROUP       | 2024        | 203050     | Transportation Infrastructure                         |
| 100894      | WOOLWORTHS GROUP LTD   | 2024        | 301010     | Consumer Staples Distribution &amp; Retail (New Name) |
| 212427      | FORTESCUE LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 101601      | WESFARMERS LTD         | 2024        | 255030     | Broadline Retail (New Name)                           |
| 226744      | RAMSAY HEALTH CARE LTD | 2024        | 351020     | Health Care Providers &amp; Services                  |
| 220244      | QANTAS AIRWAYS LTD     | 2024        | 203020     | Passenger Airlines (New name)                         |
| 017525      | ORIGIN ENERGY LTD      | 2024        | 551010     | Electric Utilities                                    |

# 查看 firms_ind_inner 中，是否哪些公司没有对应的行业代码，或者哪些行业没有对应的公司
firms_ind_inner |>
  filter(is.na(gvkey) | is.na(industry))

| gvkey <chr> | conm <chr> | fyear <chr> | gind <dbl> | industry <chr> |
|-------------|------------|-------------|------------|----------------|

接着，我们将公司表和行业表进行左连接：我们保留公司表中的所有行，并从行业表添加匹配的列：

firms_ind_left <- firms |>
  left_join(gics_industry, by = join_by(gind))

firms_ind_left |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | gind <dbl> | industry <chr>                                        |
|-------------|------------------------|-------------|------------|-------------------------------------------------------|
| 013312      | BHP GROUP LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 210216      | TELSTRA GROUP LIMITED  | 2024        | 501010     | Diversified Telecommunication Services                |
| 223003      | CSL LTD                | 2024        | 352010     | Biotechnology                                         |
| 212650      | TRANSURBAN GROUP       | 2024        | 203050     | Transportation Infrastructure                         |
| 100894      | WOOLWORTHS GROUP LTD   | 2024        | 301010     | Consumer Staples Distribution &amp; Retail (New Name) |
| 212427      | FORTESCUE LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 101601      | WESFARMERS LTD         | 2024        | 255030     | Broadline Retail (New Name)                           |
| 226744      | RAMSAY HEALTH CARE LTD | 2024        | 351020     | Health Care Providers &amp; Services                  |
| 220244      | QANTAS AIRWAYS LTD     | 2024        | 203020     | Passenger Airlines (New name)                         |
| 017525      | ORIGIN ENERGY LTD      | 2024        | 551010     | Electric Utilities                                    |

# 查看 firms_ind_left 中，是否哪些公司没有对应的行业代码，或者哪些行业没有对应的公司
firms_ind_left |>
  filter(is.na(gvkey) | is.na(industry))

| gvkey <chr> | conm <chr>         | fyear <chr> | gind <dbl> | industry <chr> |
|-------------|--------------------|-------------|------------|----------------|
| 326703      | L1 LONG SHORT FUND | 2024        | NA         | NA             |

接着，我们将公司表和行业表进行右连接：我们保留行业表中的所有行，并从公司表添加匹配的列：

firms_ind_right  <- firms |>
  right_join(gics_industry, by = join_by(gind))

firms_ind_right |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | gind <dbl> | industry <chr>                                        |
|-------------|------------------------|-------------|------------|-------------------------------------------------------|
| 013312      | BHP GROUP LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 210216      | TELSTRA GROUP LIMITED  | 2024        | 501010     | Diversified Telecommunication Services                |
| 223003      | CSL LTD                | 2024        | 352010     | Biotechnology                                         |
| 212650      | TRANSURBAN GROUP       | 2024        | 203050     | Transportation Infrastructure                         |
| 100894      | WOOLWORTHS GROUP LTD   | 2024        | 301010     | Consumer Staples Distribution &amp; Retail (New Name) |
| 212427      | FORTESCUE LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 101601      | WESFARMERS LTD         | 2024        | 255030     | Broadline Retail (New Name)                           |
| 226744      | RAMSAY HEALTH CARE LTD | 2024        | 351020     | Health Care Providers &amp; Services                  |
| 220244      | QANTAS AIRWAYS LTD     | 2024        | 203020     | Passenger Airlines (New name)                         |
| 017525      | ORIGIN ENERGY LTD      | 2024        | 551010     | Electric Utilities                                    |

# 查看 firms_ind_right 中，是否哪些公司没有对应的行业代码，或者哪些行业没有对应的公司
firms_ind_right |>
  filter(is.na(gvkey) | is.na(industry))

| gvkey <chr>    | conm <chr> | fyear <chr> | gind <dbl> | industry <chr>                                        |
|----------------|------------|-------------|------------|-------------------------------------------------------|
| NA             | NA         | NA          | 101010     | Energy Equipment &amp; Services                       |
| NA             | NA         | NA          | 151050     | Paper &amp; Forest Products                           |
| NA             | NA         | NA          | 201040     | Electrical Equipment                                  |
| NA             | NA         | NA          | 201050     | Industrial Conglomerates                              |
| NA             | NA         | NA          | 201060     | Machinery                                             |
| NA             | NA         | NA          | 203030     | Marine Transportation (New Name)                      |
| NA             | NA         | NA          | 251020     | Automobiles                                           |
| NA             | NA         | NA          | 252020     | Leisure Products                                      |
| NA             | NA         | NA          | 252030     | Textiles, Apparel &amp; Luxury Goods                  |
| NA             | NA         | NA          | 255020     | Internet &amp; Direct Marketing Retail (Discontinued) |
| NA             | NA         | NA          | 302030     | Tobacco                                               |
| NA             | NA         | NA          | 303010     | Household Products                                    |
| NA             | NA         | NA          | 303020     | Personal Care Products (New Name)                     |
| NA             | NA         | NA          | 351030     | Health Care Technology                                |
| NA             | NA         | NA          | 352030     | Life Sciences Tools &amp; Services                    |
| NA             | NA         | NA          | 401010     | Banks                                                 |
| NA             | NA         | NA          | 401020     | Thrifts &amp; Mortgage Finance (Discontinued)         |
| NA             | NA         | NA          | 402030     | Capital Markets                                       |
| NA             | NA         | NA          | 402040     | Mortgage Real Estate Investment                       |
| NA             | NA         | NA          | 451010     | Real Estate Investment Trusts (REITs)                 |
| NA             | NA         | NA          | 452010     | Communications Equipment                              |
| NA             | NA         | NA          | 452020     | Technology Hardware, Storage &amp; Peripherals        |
| NA             | NA         | NA          | 453010     | Semiconductors &amp; Semiconductor Equipment          |
| NA             | NA         | NA          | 501020     | Wireless Telecommunication Services                   |
| NA             | NA         | NA          | 551040     | Water Utilities                                       |
| NA             | NA         | NA          | 601010     | Diversified REITs (New Name)                          |
| NA             | NA         | NA          | 601025     | Industrial REITs (New)                                |
| NA             | NA         | NA          | 601030     | Hotel &amp; Resort REITs (New)                        |
| NA             | NA         | NA          | 601040     | Office REITs (New)                                    |
| NA             | NA         | NA          | 601050     | Health Care REITs (New)                               |
| NA             | NA         | NA          | 601060     | Residential REITs (New)                               |
| NA             | NA         | NA          | 601070     | Retail REITs (New)                                    |
| NA             | NA         | NA          | 601080     | Specialized REITs (New)                               |

最后，我们将公司表和行业表进行全连接：我们保留两个表中的所有行，并合并它们的列，缺失值用 NA 填充：

firms_ind_full  <- firms |>
  full_join(gics_industry, by = join_by(gind))

firms_ind_full |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | gind <dbl> | industry <chr>                                        |
|-------------|------------------------|-------------|------------|-------------------------------------------------------|
| 013312      | BHP GROUP LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 210216      | TELSTRA GROUP LIMITED  | 2024        | 501010     | Diversified Telecommunication Services                |
| 223003      | CSL LTD                | 2024        | 352010     | Biotechnology                                         |
| 212650      | TRANSURBAN GROUP       | 2024        | 203050     | Transportation Infrastructure                         |
| 100894      | WOOLWORTHS GROUP LTD   | 2024        | 301010     | Consumer Staples Distribution &amp; Retail (New Name) |
| 212427      | FORTESCUE LTD          | 2024        | 151040     | Metals &amp; Mining                                   |
| 101601      | WESFARMERS LTD         | 2024        | 255030     | Broadline Retail (New Name)                           |
| 226744      | RAMSAY HEALTH CARE LTD | 2024        | 351020     | Health Care Providers &amp; Services                  |
| 220244      | QANTAS AIRWAYS LTD     | 2024        | 203020     | Passenger Airlines (New name)                         |
| 017525      | ORIGIN ENERGY LTD      | 2024        | 551010     | Electric Utilities                                    |

# 查看 firms_ind_full 中，是否哪些公司没有对应的行业代码，或者哪些行业没有对应的公司
firms_ind_full |>
  filter(is.na(gvkey) | is.na(industry))

| gvkey <chr> | conm <chr>         | fyear <chr> | gind <dbl> | industry <chr>                                        |
|-------------|--------------------|-------------|------------|-------------------------------------------------------|
| 326703      | L1 LONG SHORT FUND | 2024        | NA         | NA                                                    |
| NA          | NA                 | NA          | 101010     | Energy Equipment &amp; Services                       |
| NA          | NA                 | NA          | 151050     | Paper &amp; Forest Products                           |
| NA          | NA                 | NA          | 201040     | Electrical Equipment                                  |
| NA          | NA                 | NA          | 201050     | Industrial Conglomerates                              |
| NA          | NA                 | NA          | 201060     | Machinery                                             |
| NA          | NA                 | NA          | 203030     | Marine Transportation (New Name)                      |
| NA          | NA                 | NA          | 251020     | Automobiles                                           |
| NA          | NA                 | NA          | 252020     | Leisure Products                                      |
| NA          | NA                 | NA          | 252030     | Textiles, Apparel &amp; Luxury Goods                  |
| NA          | NA                 | NA          | 255020     | Internet &amp; Direct Marketing Retail (Discontinued) |
| NA          | NA                 | NA          | 302030     | Tobacco                                               |
| NA          | NA                 | NA          | 303010     | Household Products                                    |
| NA          | NA                 | NA          | 303020     | Personal Care Products (New Name)                     |
| NA          | NA                 | NA          | 351030     | Health Care Technology                                |
| NA          | NA                 | NA          | 352030     | Life Sciences Tools &amp; Services                    |
| NA          | NA                 | NA          | 401010     | Banks                                                 |
| NA          | NA                 | NA          | 401020     | Thrifts &amp; Mortgage Finance (Discontinued)         |
| NA          | NA                 | NA          | 402030     | Capital Markets                                       |
| NA          | NA                 | NA          | 402040     | Mortgage Real Estate Investment Trusts (REITs)        |
| NA          | NA                 | NA          | 452010     | Communications Equipment                              |
| NA          | NA                 | NA          | 452020     | Technology Hardware, Storage &amp; Peripherals        |
| NA          | NA                 | NA          | 453010     | Semiconductors &amp; Semiconductor Equipment          |
| NA          | NA                 | NA          | 501020     | Wireless Telecommunication Services                   |
| NA          | NA                 | NA          | 551040     | Water Utilities                                       |
| NA          | NA                 | NA          | 601010     | Diversified REITs (New Name)                          |
| NA          | NA                 | NA          | 601025     | Industrial REITs (New)                                |
| NA          | NA                 | NA          | 601030     | Hotel &amp; Resort REITs (New)                        |
| NA          | NA                 | NA          | 601040     | Office REITs (New)                                    |
| NA          | NA                 | NA          | 601050     | Health Care REITs (New)                               |
| NA          | NA                 | NA          | 601060     | Residential REITs (New)                               |
| NA          | NA                 | NA          | 601070     | Retail REITs (New)                                    |
| NA          | NA                 | NA          | 601080     | Specialized REITs (New)                               |

(2) 变异连接的复合键情况¶

有的时候，两个表要想连接在一起，可能需要使用多个列作为连接键，这时候我们就需要使用复合键来进行连接。

我们在以下例子中演示一下复合键连接：

在以下两个表中，要想将学生表和班主任表链接在一起，我们需要使用 班级 和 年级 这两个列作为连接键
因为单纯 班级 或 年级 这两个列都不能唯一确定一个班级，所以我们需要将它们组合起来作为复合键来进行连接：

我们在以下 R 代码中演示一下如何使用 dplyr 包中的连接函数来实现复合键连接：

这里，我们想将 asx_200_2024 数据集和 prices_tidy 数据集连接在一起
我们先将两个数据集中的显式缺失值过滤掉，这样能让我们关注到隐式缺失值，并且把 asx_200_2024 数据集中只选取部分列：

# 过滤缺失值
asx_200_2024 <- asx_200_2024 |> select(gvkey, conm, fyear, sale, ebit, netprofit) |> drop_na(ebit)
prices_tidy <- prices_tidy |> drop_na(price)

我们先来看一下这两个数据集的结构：

asx_200_2024 |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | sale <dbl> | ebit <dbl> | netprofit <dbl> |
|-------------|------------------------|-------------|------------|------------|-----------------|
| 013312      | BHP GROUP LTD          | 2024        | 55658.0    | 22771.0    | 9601.0          |
| 210216      | TELSTRA GROUP LIMITED  | 2024        | 22928.0    | 3712.0     | 1788.0          |
| 223003      | CSL LTD                | 2024        | 14690.0    | 3896.0     | 2714.0          |
| 212650      | TRANSURBAN GROUP       | 2024        | 4119.0     | 1132.0     | 376.0           |
| 100894      | WOOLWORTHS GROUP LTD   | 2024        | 67922.0    | 3100.0     | 117.0           |
| 212427      | FORTESCUE LTD          | 2024        | 18220.0    | 8520.0     | 5664.0          |
| 101601      | WESFARMERS LTD         | 2024        | 44189.0    | 3849.0     | 2557.0          |
| 226744      | RAMSAY HEALTH CARE LTD | 2024        | 16660.2    | 938.7      | 263.3           |
| 220244      | QANTAS AIRWAYS LTD     | 2024        | 20114.0    | 2198.0     | 1251.0          |
| 017525      | ORIGIN ENERGY LTD      | 2024        | 16138.0    | 952.0      | 1397.0          |

prices_tidy |> slice_head(n = 10)

| gvkey <chr> | conm <chr>                  | fyear <chr> | price <dbl> |
|-------------|-----------------------------|-------------|-------------|
| 005302      | LABYRINTH RESOURCES LIMITED | 2021        | 0.034       |
| 005302      | LABYRINTH RESOURCES LIMITED | 2022        | 0.020       |
| 005302      | LABYRINTH RESOURCES LIMITED | 2023        | 0.006       |
| 005302      | LABYRINTH RESOURCES LIMITED | 2024        | 0.017       |
| 013312      | BHP GROUP LTD               | 2021        | 37.610      |
| 013312      | BHP GROUP LTD               | 2022        | 38.520      |
| 013312      | BHP GROUP LTD               | 2023        | 44.250      |
| 013312      | BHP GROUP LTD               | 2024        | 45.960      |
| 014242      | ANSELL LTD                  | 2021        | 34.200      |
| 014242      | ANSELL LTD                  | 2022        | 25.190      |

很明显，在 price_tidy 数据集中，不论是 gvkey 还是 fyear 这两个列都不能唯一地标识每一行：

# 检查 prices_tidy 中 gvkey 的重复情况
prices_tidy |> 
  count(gvkey) |> 
  filter(n > 1) |>
  slice_head(n = 5)

| gvkey <chr> | n <int> |
|-------------|---------|
| 005302      | 4       |
| 010991      | 3       |
| 013312      | 4       |
| 014242      | 4       |
| 014802      | 3       |

# 检查 prices_tidy 中 fyear 的重复情况
prices_tidy |> 
  count(fyear) |> 
  filter(n > 1) |>
  slice_head(n = 5)

| fyear <chr> | n <int> |
|-------------|---------|
| 2021        | 1621    |
| 2022        | 1670    |
| 2023        | 1627    |
| 2024        | 1289    |

但是，当我们将 gvkey 和 fyear 这两个列组合起来作为复合键时，就可以唯一地标识每一行了：

# 检查 prices_tidy 中 gvkey 和 fyear 的组合是否唯一
prices_tidy |>
    count(gvkey, fyear) |> 
    filter(n > 1) |>
    slice_head(n = 5)

| gvkey <chr> | fyear <chr> | n <int> |
|-------------|-------------|---------|

我们如果想将 price_tidy 数据集和 asx_200_2024 数据集连接在一起，我们需要使用 gvkey 和 fyear 这两个列作为连接键
我们先来看内连接：

financials_prices_inner <- 
  asx_200_2024 |> 
  inner_join(
    prices_tidy |> select(gvkey, fyear, price),
    by = join_by(gvkey, fyear)
  ) |>
  select(gvkey, fyear, ebit, price, everything())

financials_prices_inner |> slice_head(n = 10)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr>             | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------------------|------------|-----------------|
| 013312      | 2024        | 22771.0    | 45.96       | BHP GROUP LTD          | 55658.0    | 9601.0          |
| 210216      | 2024        | 3712.0     | 3.88        | TELSTRA GROUP LIMITED  | 22928.0    | 1788.0          |
| 223003      | 2024        | 3896.0     | 286.28      | CSL LTD                | 14690.0    | 2714.0          |
| 212650      | 2024        | 1132.0     | 13.13       | TRANSURBAN GROUP       | 4119.0     | 376.0           |
| 100894      | 2024        | 3100.0     | 33.30       | WOOLWORTHS GROUP LTD   | 67922.0    | 117.0           |
| 212427      | 2024        | 8520.0     | 20.68       | FORTESCUE LTD          | 18220.0    | 5664.0          |
| 101601      | 2024        | 3849.0     | 70.43       | WESFARMERS LTD         | 44189.0    | 2557.0          |
| 226744      | 2024        | 938.7      | 41.62       | RAMSAY HEALTH CARE LTD | 16660.2    | 263.3           |
| 220244      | 2024        | 2198.0     | 7.42        | QANTAS AIRWAYS LTD     | 20114.0    | 1251.0          |
| 017525      | 2024        | 952.0      | 10.01       | ORIGIN ENERGY LTD      | 16138.0    | 1397.0          |

# 检查是否有缺失值，将左表缺失或右表缺失的行展示代表出来
bind_rows(
    financials_prices_inner |> filter(is.na(ebit)) |> slice_head(n = 5),
    financials_prices_inner |> filter(is.na(price)) |> slice_head(n = 5)
)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr> | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------|------------|-----------------|

接着，我们来看左连接：

financials_prices_left <- 
  asx_200_2024 |> 
  left_join(
    prices_tidy |> select(gvkey, fyear, price),
    by = join_by(gvkey, fyear)
  ) |>
  select(gvkey, fyear, ebit, price, everything())

financials_prices_left |> slice_head(n = 10)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr>             | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------------------|------------|-----------------|
| 013312      | 2024        | 22771.0    | 45.96       | BHP GROUP LTD          | 55658.0    | 9601.0          |
| 210216      | 2024        | 3712.0     | 3.88        | TELSTRA GROUP LIMITED  | 22928.0    | 1788.0          |
| 223003      | 2024        | 3896.0     | 286.28      | CSL LTD                | 14690.0    | 2714.0          |
| 212650      | 2024        | 1132.0     | 13.13       | TRANSURBAN GROUP       | 4119.0     | 376.0           |
| 100894      | 2024        | 3100.0     | 33.30       | WOOLWORTHS GROUP LTD   | 67922.0    | 117.0           |
| 212427      | 2024        | 8520.0     | 20.68       | FORTESCUE LTD          | 18220.0    | 5664.0          |
| 101601      | 2024        | 3849.0     | 70.43       | WESFARMERS LTD         | 44189.0    | 2557.0          |
| 226744      | 2024        | 938.7      | 41.62       | RAMSAY HEALTH CARE LTD | 16660.2    | 263.3           |
| 220244      | 2024        | 2198.0     | 7.42        | QANTAS AIRWAYS LTD     | 20114.0    | 1251.0          |
| 017525      | 2024        | 952.0      | 10.01       | ORIGIN ENERGY LTD      | 16138.0    | 1397.0          |

# 检查是否有缺失值，将左表缺失或右表缺失的行展示代表出来
bind_rows(
    financials_prices_left |> filter(is.na(ebit)) |> slice_head(n = 5),
    financials_prices_left |> filter(is.na(price)) |> slice_head(n = 5)
)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr>             | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------------------|------------|-----------------|
| 212628      | 2024        | 1930.800   | NA          | ARISTOCRAT LEISURE LTD | 6603.600   | 1303.400        |
| 100442      | 2024        | 708.700    | NA          | ORICA LTD              | 7662.800   | 558.800         |
| 257860      | 2024        | 410.600    | NA          | INCITEC PIVOT LTD      | 5278.500   | -478.600        |
| 284805      | 2024        | 170.000    | NA          | ASCIANO LTD            | 2261.500   | -0.100          |
| 101613      | 2024        | 60.065     | NA          | NUFARM LTD             | 3345.909   | -5.598          |

接下来，我们来看右连接：

financials_prices_right <- 
  asx_200_2024 |> 
  right_join(
    prices_tidy |> select(gvkey, fyear, price),
    by = join_by(gvkey, fyear)
  ) |>
  select(gvkey, fyear, ebit, price, everything())

financials_prices_right |> slice_head(n = 10)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr>             | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------------------|------------|-----------------|
| 013312      | 2024        | 22771.0    | 45.96       | BHP GROUP LTD          | 55658.0    | 9601.0          |
| 210216      | 2024        | 3712.0     | 3.88        | TELSTRA GROUP LIMITED  | 22928.0    | 1788.0          |
| 223003      | 2024        | 3896.0     | 286.28      | CSL LTD                | 14690.0    | 2714.0          |
| 212650      | 2024        | 1132.0     | 13.13       | TRANSURBAN GROUP       | 4119.0     | 376.0           |
| 100894      | 2024        | 3100.0     | 33.30       | WOOLWORTHS GROUP LTD   | 67922.0    | 117.0           |
| 212427      | 2024        | 8520.0     | 20.68       | FORTESCUE LTD          | 18220.0    | 5664.0          |
| 101601      | 2024        | 3849.0     | 70.43       | WESFARMERS LTD         | 44189.0    | 2557.0          |
| 226744      | 2024        | 938.7      | 41.62       | RAMSAY HEALTH CARE LTD | 16660.2    | 263.3           |
| 220244      | 2024        | 2198.0     | 7.42        | QANTAS AIRWAYS LTD     | 20114.0    | 1251.0          |
| 017525      | 2024        | 952.0      | 10.01       | ORIGIN ENERGY LTD      | 16138.0    | 1397.0          |

# 检查是否有缺失值，将左表缺失或右表缺失的行展示代表出来
bind_rows(
    financials_prices_right |> filter(is.na(ebit)) |> slice_head(n = 5),
    financials_prices_right |> filter(is.na(price)) |> slice_head(n = 5)
)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr> | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------|------------|-----------------|
| 005302      | 2021        | NA         | 0.034       | NA         | NA         | NA              |
| 005302      | 2022        | NA         | 0.020       | NA         | NA         | NA              |
| 005302      | 2023        | NA         | 0.006       | NA         | NA         | NA              |
| 005302      | 2024        | NA         | 0.017       | NA         | NA         | NA              |
| 013312      | 2021        | NA         | 37.610      | NA         | NA         | NA              |

最后，我们来看全连接：

financials_prices_full <- 
  asx_200_2024 |> 
  full_join(
    prices_tidy |> select(gvkey, fyear, price),
    by = join_by(gvkey, fyear)
  ) |>
  select(gvkey, fyear, ebit, price, everything())

financials_prices_full |> slice_head(n = 10)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr>             | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------------------|------------|-----------------|
| 013312      | 2024        | 22771.0    | 45.96       | BHP GROUP LTD          | 55658.0    | 9601.0          |
| 210216      | 2024        | 3712.0     | 3.88        | TELSTRA GROUP LIMITED  | 22928.0    | 1788.0          |
| 223003      | 2024        | 3896.0     | 286.28      | CSL LTD                | 14690.0    | 2714.0          |
| 212650      | 2024        | 1132.0     | 13.13       | TRANSURBAN GROUP       | 4119.0     | 376.0           |
| 100894      | 2024        | 3100.0     | 33.30       | WOOLWORTHS GROUP LTD   | 67922.0    | 117.0           |
| 212427      | 2024        | 8520.0     | 20.68       | FORTESCUE LTD          | 18220.0    | 5664.0          |
| 101601      | 2024        | 3849.0     | 70.43       | WESFARMERS LTD         | 44189.0    | 2557.0          |
| 226744      | 2024        | 938.7      | 41.62       | RAMSAY HEALTH CARE LTD | 16660.2    | 263.3           |
| 220244      | 2024        | 2198.0     | 7.42        | QANTAS AIRWAYS LTD     | 20114.0    | 1251.0          |
| 017525      | 2024        | 952.0      | 10.01       | ORIGIN ENERGY LTD      | 16138.0    | 1397.0          |

# 检查是否有缺失值，将左表缺失或右表缺失的行展示代表出来
bind_rows(
    financials_prices_full |> filter(is.na(ebit)) |> slice_head(n = 5),
    financials_prices_full |> filter(is.na(price)) |> slice_head(n = 5)
)

| gvkey <chr> | fyear <chr> | ebit <dbl> | price <dbl> | conm <chr>             | sale <dbl> | netprofit <dbl> |
|-------------|-------------|------------|-------------|------------------------|------------|-----------------|
| 005302      | 2021        | NA         | 0.034       | NA                     | NA         | NA              |
| 005302      | 2022        | NA         | 0.020       | NA                     | NA         | NA              |
| 005302      | 2023        | NA         | 0.006       | NA                     | NA         | NA              |
| 005302      | 2024        | NA         | 0.017       | NA                     | NA         | NA              |
| 013312      | 2021        | NA         | 37.610      | NA                     | NA         | NA              |
| 212628      | 2024        | 1930.800   | NA          | ARISTOCRAT LEISURE LTD | 6603.600   | 1303.400        |
| 100442      | 2024        | 708.700    | NA          | ORICA LTD              | 7662.800   | 558.800         |
| 257860      | 2024        | 410.600    | NA          | INCITEC PIVOT LTD      | 5278.500   | -478.600        |
| 284805      | 2024        | 170.000    | NA          | ASCIANO LTD            | 2261.500   | -0.100          |
| 101613      | 2024        | 60.065     | NA          | NUFARM LTD             | 3345.909   | -5.598          |

3. 筛选连接（Filtering Joins）¶

筛选连接虽然也叫做连接，但是它并不是将两个表连接在一起，而是在一个表中，根据另一个表中的匹配情况来筛选观测值

说白了就是：检查一个表中的观测值是否在另一个表中存在
如果是半连接（semi-join），则是筛选左表中，在右表中也存在的行
如果是反连接（anti-join），则是筛选左表中，在右表中不存在的行

我们用以下这个例子来理解一下筛选连接：

半连接：只要保留学生表中，班级这一列，在班级表中存在的行：

反连接：只要保留学生表中，班级这一列，在班级表中不存在的行：

我们在 R 中使用 dplyr 包中的连接函数来实现这些连接类型：

我们先来看一下，在 asx_200_2024 数据集中有多少行：

nrow(asx_200_2024)

[1] 198

我们想找出 asx_200_2024 数据集中，在 prices_tidy 数据集中也存在的行，可以看出最终结果并没有 prices_tidy 中的列：

financials_prices_sj <- 
  asx_200_2024 |> 
  semi_join(
    prices_tidy,
    by = join_by(gvkey, fyear)
  )

financials_prices_sj |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | sale <dbl> | ebit <dbl> | netprofit <dbl> |
|-------------|------------------------|-------------|------------|------------|-----------------|
| 013312      | BHP GROUP LTD          | 2024        | 55658.0    | 22771.0    | 9601.0          |
| 210216      | TELSTRA GROUP LIMITED  | 2024        | 22928.0    | 3712.0     | 1788.0          |
| 223003      | CSL LTD                | 2024        | 14690.0    | 3896.0     | 2714.0          |
| 212650      | TRANSURBAN GROUP       | 2024        | 4119.0     | 1132.0     | 376.0           |
| 100894      | WOOLWORTHS GROUP LTD   | 2024        | 67922.0    | 3100.0     | 117.0           |
| 212427      | FORTESCUE LTD          | 2024        | 18220.0    | 8520.0     | 5664.0          |
| 101601      | WESFARMERS LTD         | 2024        | 44189.0    | 3849.0     | 2557.0          |
| 226744      | RAMSAY HEALTH CARE LTD | 2024        | 16660.2    | 938.7      | 263.3           |
| 220244      | QANTAS AIRWAYS LTD     | 2024        | 20114.0    | 2198.0     | 1251.0          |
| 017525      | ORIGIN ENERGY LTD      | 2024        | 16138.0    | 952.0      | 1397.0          |

nrow(financials_prices_sj)

[1] 182

我们想找出 asx_200_2024 数据集中，在 prices_tidy 数据集中不存在的行，可以看出最终结果也没有 prices_tidy 中的列：

# 找出在 asx_200_2024 中有但在 prices_tidy 中没有的行
financials_prices_aj <- 
  asx_200_2024 |> 
  anti_join(
    prices_tidy,
    by = join_by(gvkey, fyear)
  )

# 显示不匹配的观测值
financials_prices_aj |> slice_head(n = 10)

| gvkey <chr> | conm <chr>             | fyear <chr> | sale <dbl> | ebit <dbl> | netprofit <dbl> |
|-------------|------------------------|-------------|------------|------------|-----------------|
| 212628      | ARISTOCRAT LEISURE LTD | 2024        | 6603.600   | 1930.800   | 1303.400        |
| 100442      | ORICA LTD              | 2024        | 7662.800   | 708.700    | 558.800         |
| 257860      | INCITEC PIVOT LTD      | 2024        | 5278.500   | 410.600    | -478.600        |
| 284805      | ASCIANO LTD            | 2024        | 2261.500   | 170.000    | -0.100          |
| 101613      | NUFARM LTD             | 2024        | 3345.909   | 60.065     | -5.598          |
| 253358      | WESTERN AREAS LTD      | 2024        | 844.100    | -394.300   | 2.800           |
| 267188      | CUSCAL LIMITED         | 2024        | 463.600    | 153.200    | 30.100          |
| 241386      | GRAINCORP LTD          | 2024        | 6506.800   | 21.700     | 61.800          |
| 270244      | MYER HOLDINGS LIMITED  | 2024        | 2438.100   | 162.700    | 43.500          |
| 202362      | ELDERS LTD             | 2024        | 3131.290   | 113.578    | 49.848          |

nrow(financials_prices_aj)

[1] 16

4. 数据连接小结¶

在 dplyr 包中，提供了多种连接函数来实现不同类型的数据连接：

变异连接：
- 内连接（inner_join() 函数）：只保留两个表中都存在的行
- 左连接（left_join() 函数）：保留左表中的所有行，以及右表中匹配的行
- 右连接（right_join() 函数）：保留右表中的所有行，以及左表中匹配的行
- 全连接（full_join() 函数）：保留两个表中的所有行，匹配不到的部分用缺失值填充
筛选连接：
- 半连接（semi_join() 函数）：保留左表中，在右表中也存在的行
- 反连接（anti_join() 函数）：保留左表中，在右表中不存在的行

大家可以通过我们给大家举的图例，一定要理解各种连接的逻辑关系，这样才能正确地使用这些连接函数来处理数据。