TestContR

Before you use this package you need to answer the following 2 questions:

Do I want a randomized list of individuals or do I want the Top N observations for a list of individuals that I already have?

If randomized, then use a match_* function.
if top n, then use a top_n* function.

Do you have data with only numeric variables, or does your data have both numeric and categorical?

If numeric, then use a *_numeric function.
If mixed, then use a *_mixed function.

So your choices for how you want to proceed are listed below:

match_numeric(): randomized output, numeric input.
match_mixed(): randomized output, mixed input.
topn_numeric(): list of top n matches for output, numeric input.
topn_mixed(): list of top n matches for output, mixed input.

Randomized Sample, match_* functions

R contains a crime data set for the all 50 states. This data set contains data on murder rates, assaults, urban population and the occurrences of rape. The TestContR can be used to match states that have similar crime rates.

library(dplyr)
library(TestContR)

match_numeric(): Random selection of test and control groups/individuals for numeric metrics/variables.

Numeric only dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(USArrests)) %>%
                               dplyr::select(state, everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

state	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10.0	263	48	44.5
Arizona	8.1	294	80	31.0
Arkansas	8.8	190	50	19.5
California	9.0	276	91	40.6
Colorado	7.9	204	78	38.7
Connecticut	3.3	110	77	11.1
Delaware	5.9	238	72	15.8
Florida	15.4	335	80	31.9
Georgia	17.4	211	60	25.8

Build Test and Control list:

# defaults to 10 obs for the test group with matching controls. Change the size of the test group w/ param "n".

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_numeric(df)
#> [1] "The 1th de-duping iteration started"
#> Joining, by = "CONTROL"
#> Joining, by = c("CONTROL", "TEST", "DIST_Q", "GROUP")
#> [1] "The 1th de-duping iteration complete."

Results of random selection option:

knitr::kable(TEST_CONTROL_LIST)

CONTROL	TEST	DIST_Q	GROUP
Texas	Illinois	0.8241352	1
New Mexico	Michigan	0.5782474	2
Arizona	New York	1.0725219	3
South Carolina	North Carolina	1.0476313	4
Maine	North Dakota	0.7305609	5
Oklahoma	Ohio	0.6483903	6
Kansas	Pennsylvania	0.5456840	7
Alabama	Tennessee	0.8407489	8
Washington	Utah	0.6940667	9
South Dakota	West Virginia	0.7108812	10

Providing a list of Test Groups/Individuals (No randomization of the test group)

TEST_GRP <- tribble(~'TEST','Colorado','Minnesota','Florida','South Carolina')

Example of data frame for the “test_list” input parameter:

knitr::kable(TEST_GRP)

TEST
Colorado
Minnesota
Florida
South Carolina

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_numeric(df, test_list = TEST_GRP)

Results for the “test_list” input parameter:

knitr::kable(TEST_CONTROL_LIST)

CONTROL	TEST	DIST_Q	GROUP
Michigan	Colorado	1.2363108	1
New Mexico	Florida	1.2965798	2
Wisconsin	Minnesota	0.4940832	3
Mississippi	South Carolina	0.7865674	4

match_mixed(): Random selection of test and control groups/individuals with mixed metrics/variables, meaning both numeric and categorical.

Numeric and categorical dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(datasets::USArrests)) %>%
  base::cbind(datasets::state.division) %>%
  dplyr::select(state, dplyr::everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

state	Murder	Assault	UrbanPop	Rape	datasets::state.division
Alabama	13.2	236	58	21.2	East South Central
Alaska	10.0	263	48	44.5	Pacific
Arizona	8.1	294	80	31.0	Mountain
Arkansas	8.8	190	50	19.5	West South Central
California	9.0	276	91	40.6	Pacific
Colorado	7.9	204	78	38.7	Mountain
Connecticut	3.3	110	77	11.1	New England
Delaware	5.9	238	72	15.8	South Atlantic
Florida	15.4	335	80	31.9	South Atlantic
Georgia	17.4	211	60	25.8	South Atlantic

Build Test and Control list from mixed metrics:

# defaults to 10 obs for the test group with matching controls. Change the size of the test group w/ param "n".

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_mixed(df)
#> [1] "The 1th de-duping iteration started"
#> Joining, by = "CONTROL"
#> Joining, by = c("CONTROL", "TEST", "DIST_Q", "GROUP")
#> [1] "The 1th de-duping iteration complete."
#> [1] "The 2th de-duping iteration started"
#> Joining, by = "CONTROL"
#> Joining, by = c("CONTROL", "TEST", "DIST_Q", "GROUP")
#> [1] "The 2th de-duping iteration complete."

Results of random selection option:

knitr::kable(TEST_CONTROL_LIST)

CONTROL	TEST	DIST_Q	GROUP
Texas	Illinois	0.2785090	1
New Mexico	Michigan	0.2580449	2
Maryland	New York	0.3071088	3
South Carolina	North Carolina	0.0998379	4
Iowa	North Dakota	0.0891413	5
Indiana	Ohio	0.0419648	6
New Jersey	Pennsylvania	0.1273365	7
Alabama	Tennessee	0.0657239	8
Idaho	Utah	0.1403257	9
Virginia	West Virginia	0.2253755	10

Providing a list of Test Groups/Individuals (No randomization of the test group)

TEST_GRP <- tribble(~'TEST','Colorado','Minnesota','Florida','South Carolina')

Example of data frame for the “test_list” input parameter:

knitr::kable(TEST_GRP)

TEST
Colorado
Minnesota
Florida
South Carolina

set.seed(99)
TEST_CONTROL_LIST <- TestContR::match_mixed(df, test_list = TEST_GRP)

Results for the “test_list” input parameter:

knitr::kable(TEST_CONTROL_LIST)

CONTROL	TEST	DIST_Q	GROUP
Arizona	Colorado	0.1106264	1
Maryland	Florida	0.1386266	2
Nebraska	Minnesota	0.0616531	3
North Carolina	South Carolina	0.0998379	4

Top N matches for individuals or groups, the topn_* functions

NOTE: You can provide more than one group to the topn_* functions, but the function does not remove duplicates in the control list for the more than 1 group or individual. WARNING Because of this, if you provide topn_* functions with a full dataset of size M with a function parameter “n” ~ M and no “test_list”, then you will get an Mx~M matrix, where n is the function parameter that determines the size of the list of matches. For the topn_mixed function this may take a very long time to complete. In other words, you should be selective of the size of Top N matches you want to create and it highly advised to use the “test_list” parameter when possible. More on this below.

topn_numeric(): Select Top N Controls for a set of groups or individuals

Build/provide a list of the obs of interest in the test_list:

test_list <- tribble(~"TEST","Colorado")

Numeric only dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(USArrests)) %>%
                               dplyr::select(state, everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

state	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10.0	263	48	44.5
Arizona	8.1	294	80	31.0
Arkansas	8.8	190	50	19.5
California	9.0	276	91	40.6
Colorado	7.9	204	78	38.7
Connecticut	3.3	110	77	11.1
Delaware	5.9	238	72	15.8
Florida	15.4	335	80	31.9
Georgia	17.4	211	60	25.8

Build the list of Top N matches: Provide the test_list dataframe to the test_list parameter in the function as below.

TOPN_CONTROL_LIST <- TestContR::topn_numeric(df, topN = 10, test_list = test_list)

Results of Top N selection option:

knitr::kable(head(TOPN_CONTROL_LIST,20))

CONTROL	TEST	DIST_Q	DIST_RANK
Michigan	Colorado	1.236311	1
California	Colorado	1.287618	2
Missouri	Colorado	1.312741	3
Arizona	Colorado	1.365031	4
Nevada	Colorado	1.398859	5
Oregon	Colorado	1.533198	6
New Mexico	Colorado	1.546744	7
New York	Colorado	1.736339	8
Washington	Colorado	1.789792	9
Illinois	Colorado	1.789832	10

Top N without a Test List: Don’t be concerned about the warning; I just wanted to let users know that it would use all the labels in the dataframe.

TOPN_CONTROL_LIST <- TestContR::topn_numeric(df, topN = 10)
#> Warning in TestContR::topn_numeric(df, topN = 10): If no dataframe provided for the "test_list" parameter, will use all the labels in the dataset.  Otherwise, please provide a dataframe for the "test_list" parameter with 1, or N, Test group(s) or individual(s) label(s) in a column named "TEST."
#> 
#>       See documentation for topn_numeric's test_list parameter

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))

CONTROL	TEST	DIST_Q	DIST_RANK
Louisiana	Alabama	0.7722224	1
Tennessee	Alabama	0.8407489	2
South Carolina	Alabama	0.9157968	3
Georgia	Alabama	1.1314351	4
Mississippi	Alabama	1.2831907	5
Maryland	Alabama	1.2896460	6
Arkansas	Alabama	1.2898102	7
Virginia	Alabama	1.4859733	8
New Mexico	Alabama	1.5993970	9
North Carolina	Alabama	1.6043662	10
New Mexico	Alaska	2.0580889	1
Michigan	Alaska	2.1154937	2
Maryland	Alaska	2.2777590	3
Colorado	Alaska	2.3265187	4
Tennessee	Alaska	2.3362541	5
Nevada	Alaska	2.3443182	6
Missouri	Alaska	2.5360573	7
South Carolina	Alaska	2.5640542	8
Oregon	Alaska	2.6990696	9
Arizona	Alaska	2.7006429	10

topN_mixed(): Random selection of test and control groups/individuals with mixed metrics/

Numeric and categorical dataframe:

df <- datasets::USArrests %>% dplyr::mutate(state = base::row.names(datasets::USArrests)) %>%
  base::cbind(datasets::state.division) %>%
  dplyr::select(state, dplyr::everything())

Expected data set format with individuals/labels/names/id in the first column:

knitr::kable(head(df, n = 10))

state	Murder	Assault	UrbanPop	Rape	datasets::state.division
Alabama	13.2	236	58	21.2	East South Central
Alaska	10.0	263	48	44.5	Pacific
Arizona	8.1	294	80	31.0	Mountain
Arkansas	8.8	190	50	19.5	West South Central
California	9.0	276	91	40.6	Pacific
Colorado	7.9	204	78	38.7	Mountain
Connecticut	3.3	110	77	11.1	New England
Delaware	5.9	238	72	15.8	South Atlantic
Florida	15.4	335	80	31.9	South Atlantic
Georgia	17.4	211	60	25.8	South Atlantic

Build Test and Control list from mixed metrics:

set.seed(99)
TOPN_CONTROL_LIST <- TestContR::topn_mixed(df, topN = 10, test_list = test_list)

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))

CONTROL	TEST	DIST_Q	DIST_RANK
Arizona	Colorado	0.1106264	1
Nevada	Colorado	0.1325795	2
New Mexico	Colorado	0.1588753	3
Utah	Colorado	0.2025942	4
Wyoming	Colorado	0.2231019	5
Montana	Colorado	0.2879513	6
Missouri	Colorado	0.3124434	7
California	Colorado	0.3164550	8
Michigan	Colorado	0.3176979	9
Idaho	Colorado	0.3293606	10

Top N Mixed without a Test List Don’t be concerned about the warning; I just wanted to let users know that it would use all the labels in the dataframe.

TOPN_CONTROL_LIST <- TestContR::topn_mixed(df, topN = 10)
#> Warning in TestContR::topn_mixed(df, topN = 10): If no dataframe provided for the "test_list" parameter, will use all the labels in the dataset.  Otherwise, please provide a dataframe for the "test_list" parameter with 1, or N, Test group(s) or individual(s) label(s) in a column named "TEST."
#> 
#>       See documentation for topn_numeric's test_list parameter

Results of Top N selection without Test List:

knitr::kable(head(TOPN_CONTROL_LIST,20))

CONTROL	TEST	DIST_Q	DIST_RANK
Tennessee	Alabama	0.0657239	1
Mississippi	Alabama	0.1193394	2
Kentucky	Alabama	0.1748170	3
Louisiana	Alabama	0.2676967	4
South Carolina	Alabama	0.2845265	5
Georgia	Alabama	0.2982780	6
Arkansas	Alabama	0.3204231	7
Texas	Alabama	0.3267952	8
Virginia	Alabama	0.3309542	9
Maryland	Alabama	0.3313442	10
California	Alaska	0.1868701	1
Oregon	Alaska	0.2756384	2
Washington	Alaska	0.3324305	3
Nevada	Alaska	0.3536566	4
Michigan	Alaska	0.3674951	5
New Mexico	Alaska	0.3705949	6
South Carolina	Alaska	0.3776660	7
Maryland	Alaska	0.3917168	8
Colorado	Alaska	0.3973812	9
Arkansas	Alaska	0.4004365	10

Conclusion

Depending on your experiment, it may be prudent to add categorical metrics/variables that will help align your data better. In the above examples, when only using the numerical data Alabama’s nearest match is Louisiana, but once region is taken into consideration, Alabama’s nearest match is Tennessee. Now you have the tools to create a list of nearest matches for your data whether it is numeric or mixed.

Alfredo G Marquez

2020-07-29

Randomized Sample, match_* functions

match_numeric(): Random selection of test and control groups/individuals for numeric metrics/variables.

Providing a list of Test Groups/Individuals (No randomization of the test group)

match_mixed(): Random selection of test and control groups/individuals with mixed metrics/variables, meaning both numeric and categorical.

Providing a list of Test Groups/Individuals (No randomization of the test group)

Top N matches for individuals or groups, the topn_* functions

topn_numeric(): Select Top N Controls for a set of groups or individuals

topN_mixed(): Random selection of test and control groups/individuals with mixed metrics/

Conclusion