Before you use this package you need to answer the following 2 questions:

  1. Do I want a randomized list of individuals or do I want the Top N observations for a list of individuals that I already have?
  • If randomized, then use a match_* function.
  • if top n, then use a top_n* function.
  1. Do you have data with only numeric variables, or does your data have both numeric and categorical?
  • If numeric, then use a *_numeric function.
  • If mixed, then use a *_mixed function.

So your choices for how you want to proceed are listed below:

  • match_numeric(): randomized output, numeric input.
  • match_mixed(): randomized output, mixed input.
  • topn_numeric(): list of top n matches for output, numeric input.
  • topn_mixed(): list of top n matches for output, mixed input.

Randomized Sample, match_* functions

R contains a crime data set for the all 50 states. This data set contains data on murder rates, assaults, urban population and the occurrences of rape. The TestContR can be used to match states that have similar crime rates.

library(dplyr)
library(TestContR)

match_numeric(): Random selection of test and control groups/individuals for numeric metrics/variables.

Numeric only dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(USArrests)) %>%
                               dplyr::select(state, everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))
state Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8

Build Test and Control list:

# defaults to 10 obs for the test group with matching controls. Change the size of the test group w/ param "n".

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_numeric(df)
#> [1] "The 1th de-duping iteration started"
#> Joining, by = "CONTROL"
#> Joining, by = c("CONTROL", "TEST", "DIST_Q", "GROUP")
#> [1] "The 1th de-duping iteration complete."

Results of random selection option:

knitr::kable(TEST_CONTROL_LIST)
CONTROL TEST DIST_Q GROUP
Texas Illinois 0.8241352 1
New Mexico Michigan 0.5782474 2
Arizona New York 1.0725219 3
South Carolina North Carolina 1.0476313 4
Maine North Dakota 0.7305609 5
Oklahoma Ohio 0.6483903 6
Kansas Pennsylvania 0.5456840 7
Alabama Tennessee 0.8407489 8
Washington Utah 0.6940667 9
South Dakota West Virginia 0.7108812 10


Providing a list of Test Groups/Individuals (No randomization of the test group)

TEST_GRP <- tribble(~'TEST','Colorado','Minnesota','Florida','South Carolina')

Example of data frame for the “test_list” input parameter:

knitr::kable(TEST_GRP)
TEST
Colorado
Minnesota
Florida
South Carolina
set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_numeric(df, test_list = TEST_GRP)

Results for the “test_list” input parameter:

knitr::kable(TEST_CONTROL_LIST)
CONTROL TEST DIST_Q GROUP
Michigan Colorado 1.2363108 1
New Mexico Florida 1.2965798 2
Wisconsin Minnesota 0.4940832 3
Mississippi South Carolina 0.7865674 4

match_mixed(): Random selection of test and control groups/individuals with mixed metrics/variables, meaning both numeric and categorical.

Numeric and categorical dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(datasets::USArrests)) %>%
  base::cbind(datasets::state.division) %>%
  dplyr::select(state, dplyr::everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))
state Murder Assault UrbanPop Rape datasets::state.division
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
Arizona 8.1 294 80 31.0 Mountain
Arkansas 8.8 190 50 19.5 West South Central
California 9.0 276 91 40.6 Pacific
Colorado 7.9 204 78 38.7 Mountain
Connecticut 3.3 110 77 11.1 New England
Delaware 5.9 238 72 15.8 South Atlantic
Florida 15.4 335 80 31.9 South Atlantic
Georgia 17.4 211 60 25.8 South Atlantic

Build Test and Control list from mixed metrics:

# defaults to 10 obs for the test group with matching controls. Change the size of the test group w/ param "n".

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_mixed(df)
#> [1] "The 1th de-duping iteration started"
#> Joining, by = "CONTROL"
#> Joining, by = c("CONTROL", "TEST", "DIST_Q", "GROUP")
#> [1] "The 1th de-duping iteration complete."
#> [1] "The 2th de-duping iteration started"
#> Joining, by = "CONTROL"
#> Joining, by = c("CONTROL", "TEST", "DIST_Q", "GROUP")
#> [1] "The 2th de-duping iteration complete."

Results of random selection option:

knitr::kable(TEST_CONTROL_LIST)
CONTROL TEST DIST_Q GROUP
Texas Illinois 0.2785090 1
New Mexico Michigan 0.2580449 2
Maryland New York 0.3071088 3
South Carolina North Carolina 0.0998379 4
Iowa North Dakota 0.0891413 5
Indiana Ohio 0.0419648 6
New Jersey Pennsylvania 0.1273365 7
Alabama Tennessee 0.0657239 8
Idaho Utah 0.1403257 9
Virginia West Virginia 0.2253755 10

Providing a list of Test Groups/Individuals (No randomization of the test group)

TEST_GRP <- tribble(~'TEST','Colorado','Minnesota','Florida','South Carolina')

Example of data frame for the “test_list” input parameter:

knitr::kable(TEST_GRP)
TEST
Colorado
Minnesota
Florida
South Carolina
set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_mixed(df, test_list = TEST_GRP)

Results for the “test_list” input parameter:

knitr::kable(TEST_CONTROL_LIST)
CONTROL TEST DIST_Q GROUP
Arizona Colorado 0.1106264 1
Maryland Florida 0.1386266 2
Nebraska Minnesota 0.0616531 3
North Carolina South Carolina 0.0998379 4


Top N matches for individuals or groups, the topn_* functions

  • NOTE: You can provide more than one group to the topn_* functions, but the function does not remove duplicates in the control list for the more than 1 group or individual. WARNING Because of this, if you provide topn_* functions with a full dataset of size M with a function parameter “n” ~ M and no “test_list”, then you will get an Mx~M matrix, where n is the function parameter that determines the size of the list of matches. For the topn_mixed function this may take a very long time to complete. In other words, you should be selective of the size of Top N matches you want to create and it highly advised to use the “test_list” parameter when possible. More on this below.

topn_numeric(): Select Top N Controls for a set of groups or individuals

Build/provide a list of the obs of interest in the test_list:

test_list <- tribble(~"TEST","Colorado")

Numeric only dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(USArrests)) %>%
                               dplyr::select(state, everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))
state Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8

Build the list of Top N matches: Provide the test_list dataframe to the test_list parameter in the function as below.

TOPN_CONTROL_LIST <- TestContR::topn_numeric(df, topN = 10, test_list = test_list)

Results of Top N selection option:

knitr::kable(head(TOPN_CONTROL_LIST,20))
CONTROL TEST DIST_Q DIST_RANK
Michigan Colorado 1.236311 1
California Colorado 1.287618 2
Missouri Colorado 1.312741 3
Arizona Colorado 1.365031 4
Nevada Colorado 1.398859 5
Oregon Colorado 1.533198 6
New Mexico Colorado 1.546744 7
New York Colorado 1.736339 8
Washington Colorado 1.789792 9
Illinois Colorado 1.789832 10

Top N without a Test List: Don’t be concerned about the warning; I just wanted to let users know that it would use all the labels in the dataframe.

TOPN_CONTROL_LIST <- TestContR::topn_numeric(df, topN = 10)
#> Warning in TestContR::topn_numeric(df, topN = 10): If no dataframe provided for the "test_list" parameter, will use all the labels in the dataset.  Otherwise, please provide a dataframe for the "test_list" parameter with 1, or N, Test group(s) or individual(s) label(s) in a column named "TEST."
#> 
#>       See documentation for topn_numeric's test_list parameter

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))
CONTROL TEST DIST_Q DIST_RANK
Louisiana Alabama 0.7722224 1
Tennessee Alabama 0.8407489 2
South Carolina Alabama 0.9157968 3
Georgia Alabama 1.1314351 4
Mississippi Alabama 1.2831907 5
Maryland Alabama 1.2896460 6
Arkansas Alabama 1.2898102 7
Virginia Alabama 1.4859733 8
New Mexico Alabama 1.5993970 9
North Carolina Alabama 1.6043662 10
New Mexico Alaska 2.0580889 1
Michigan Alaska 2.1154937 2
Maryland Alaska 2.2777590 3
Colorado Alaska 2.3265187 4
Tennessee Alaska 2.3362541 5
Nevada Alaska 2.3443182 6
Missouri Alaska 2.5360573 7
South Carolina Alaska 2.5640542 8
Oregon Alaska 2.6990696 9
Arizona Alaska 2.7006429 10


topN_mixed(): Random selection of test and control groups/individuals with mixed metrics/

Numeric and categorical dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(datasets::USArrests)) %>%
  base::cbind(datasets::state.division) %>%
  dplyr::select(state, dplyr::everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))
state Murder Assault UrbanPop Rape datasets::state.division
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
Arizona 8.1 294 80 31.0 Mountain
Arkansas 8.8 190 50 19.5 West South Central
California 9.0 276 91 40.6 Pacific
Colorado 7.9 204 78 38.7 Mountain
Connecticut 3.3 110 77 11.1 New England
Delaware 5.9 238 72 15.8 South Atlantic
Florida 15.4 335 80 31.9 South Atlantic
Georgia 17.4 211 60 25.8 South Atlantic

Build Test and Control list from mixed metrics:

set.seed(99)
TOPN_CONTROL_LIST <- TestContR::topn_mixed(df, topN = 10, test_list = test_list)

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))
CONTROL TEST DIST_Q DIST_RANK
Arizona Colorado 0.1106264 1
Nevada Colorado 0.1325795 2
New Mexico Colorado 0.1588753 3
Utah Colorado 0.2025942 4
Wyoming Colorado 0.2231019 5
Montana Colorado 0.2879513 6
Missouri Colorado 0.3124434 7
California Colorado 0.3164550 8
Michigan Colorado 0.3176979 9
Idaho Colorado 0.3293606 10

Top N Mixed without a Test List Don’t be concerned about the warning; I just wanted to let users know that it would use all the labels in the dataframe.

TOPN_CONTROL_LIST <- TestContR::topn_mixed(df, topN = 10)
#> Warning in TestContR::topn_mixed(df, topN = 10): If no dataframe provided for the "test_list" parameter, will use all the labels in the dataset.  Otherwise, please provide a dataframe for the "test_list" parameter with 1, or N, Test group(s) or individual(s) label(s) in a column named "TEST."
#> 
#>       See documentation for topn_numeric's test_list parameter

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))
CONTROL TEST DIST_Q DIST_RANK
Tennessee Alabama 0.0657239 1
Mississippi Alabama 0.1193394 2
Kentucky Alabama 0.1748170 3
Louisiana Alabama 0.2676967 4
South Carolina Alabama 0.2845265 5
Georgia Alabama 0.2982780 6
Arkansas Alabama 0.3204231 7
Texas Alabama 0.3267952 8
Virginia Alabama 0.3309542 9
Maryland Alabama 0.3313442 10
California Alaska 0.1868701 1
Oregon Alaska 0.2756384 2
Washington Alaska 0.3324305 3
Nevada Alaska 0.3536566 4
Michigan Alaska 0.3674951 5
New Mexico Alaska 0.3705949 6
South Carolina Alaska 0.3776660 7
Maryland Alaska 0.3917168 8
Colorado Alaska 0.3973812 9
Arkansas Alaska 0.4004365 10


Conclusion

Depending on your experiment, it may be prudent to add categorical metrics/variables that will help align your data better. In the above examples, when only using the numerical data Alabama’s nearest match is Louisiana, but once region is taken into consideration, Alabama’s nearest match is Tennessee. Now you have the tools to create a list of nearest matches for your data whether it is numeric or mixed.