{compareDFx} compares two data frames with detailed reporting natively in R or in MS Excel. The MS Excel report is a multi-tab workbook that color codes additions, deletions, and changes which is ideal for R users and non-R users alike and makes identifying data issues and version differences quick and easy.

Installation
You can install the development version of {compareDFx} like so:
# install.package("devtools")
install_github("skgithub14/compareDFx")Comparing two data frames using {compareDFx}
Let’s assume that df1 and df2 are two versions of the same data set, where df1 is the newer version. We want to know exactly which values changed. Using the two example data frames included with {compareDFx}, df1 and df2, we will walk through the {compareDFx} workflow.
Creating the report
We will start by loading the {compareDFx} package and defining our data frames:
library(compareDFx)
data(df1)
data(df2)View the example data
# example data, df1
knitr::kable(df1)| id1 | id2 | num | char | int | log | date | class_num_char | class_num_log | dec_diff_ref | dec_diff | extra1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A | 1 | a | 1 | NA | 2023-01-01 | 100 | NA | 1 | 1 | extra1 |
| 2 | B | 2 | NA | NA | TRUE | NA | 100 | NA | 1 | 1 | extra1 |
| 3 | C | NA | c | 2 | FALSE | 2023-01-03 | 100 | NA | 1 | 1 | extra1 |
| 4 | D | 4 | d | 3 | NA | 2023-01-04 | 100 | NA | 1 | 1 | extra1 |
| 5 | E | 5 | e | 5 | FALSE | 2023-01-05 | 100 | NA | 1 | 1 | extra1 |
| 5 | E | 5 | g | 5 | FALSE | 2023-01-05 | 100 | NA | 1 | 1 | extra1 |
| 5 | Z | 5 | e | 5 | FALSE | 2023-01-05 | 100 | NA | 1 | 1 | extra1 |
| 5 | Z | 5 | e | 5 | FALSE | 2023-01-05 | 100 | NA | 1 | 1 | extra1 |
| 5 | NA | 5 | g | 5 | FALSE | 2023-01-05 | 100 | NA | 1 | 1 | extra1 |
| 6 | M | 5 | e | 5 | FALSE | 2023-01-05 | 100 | NA | 1 | 1 | extra1 |
# example data, df2
knitr::kable(df2)| id1 | id2 | num | char | int | log | date | class_num_char | class_num_log | dec_diff_ref | dec_diff | extra2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A | 1 | a | 1 | NA | 2023-01-01 | 100 | NA | 0e+00 | 1.000000 | extra2 |
| 1 | A | 1 | a | 1 | NA | 2023-01-01 | 100 | NA | 0e+00 | 1.000000 | extra2 |
| 1 | A | 1 | a | 1 | NA | 2023-01-01 | 100 | NA | 0e+00 | 1.000000 | extra2 |
| 2 | B | 2 | NA | NA | TRUE | NA | 100 | NA | -1e-04 | 1.000100 | extra2 |
| 3 | C | NA | c | 2 | FALSE | 2023-01-03 | 100 | NA | -1e-05 | 1.000010 | extra2 |
| 4 | D | 4 | d | 3 | NA | 2023-01-04 | 100 | NA | -1e-06 | 1.000001 | extra2 |
| 6 | F | 6 | f | 6 | TRUE | 2023-01-06 | 100 | NA | 0e+00 | 1.000000 | extra2 |
| 6 | M | 5 | e | 5 | FALSE | 2023-01-05 | 100 | NA | 1e+00 | 1.000000 | extra2 |
{compareDFx} uses a row’s unique ID to evaluate what changed between two data frames. Unique IDs can consist of one or multiple columns. For df1 and df2, the columns the collectively make a unique row ID are id1 and id2. This means that in df1 there are two columns named id1 and id2 and in df2 there are also two columns named id1 and id2.
id_cols <- c("id1", "id2")Now we can run the comparison and generate the report in Excel:
# run the comparison
comparison <- get_comparison(df1, df2, id_cols)
# create the excel report
create_comparison_excel(comparison, path = "comparison_report.xlsx")Now we have a multi-tab, color-coded MS Excel report with summary statistics. On the data tabs of the report, all cells from df1 will have a white background and all cells from df2 will have a grey background. The text is colored accordingly:
Green text: cell values in
df1that are different fromdf2Red text: cell values in
df2that are different fromdf1Black text: cell values that are the unchanged between
df1anddf2
You can explore the file yourself here:
Download comparison_report.xlsx
As you can see, there are six tabs in the Excel workbook:
- summary: summary statistics

- summary by column: column summary statistics

- data top-bottom: the combined data displayed so that each row of
df1is directly above itsdf2counterpart.

- data left-right: the same data as shown in data top-bottom, but where the
df1columns are immediately to the left of theirdf2counterpart.

df1: the raw data from
df1, for referencedf2: the raw data from
df2, for reference
If you aren’t interested in the Excel report but want to work with the report in R instead, you can use the output from comparison object directly (see the documentation for get_comparison()).
Understanding the report
If you look to the far left in the data tabs (data top-bottom and data left-right), you will see some extra columns that report useful comparison information:
-
discrepancy: the comparison status, which can take one or more of the following values:"matched": a row that exactly matched betweendf1anddf2"addition": a row that only existed indf1"deletion": a row that only existed indf2"changed": a row where theid_colsmatched between betweendf1anddf2but which had a least one other column with a change"exact duplicate": a row with multiple occurrences in eitherdf1ordf2. “Exact” means that every value in the row matched perfectly to its duplicate copies. Exact duplicates can also be marked as"matched","addition","deletion", or"changed"."ID duplicate (not exact duplicate)": a row that had multiple occurrences of the same combination ofid_colsin eitherdf1ordf2but for which there was not an exact duplicate (meaning at least one non-ID value in the row was different). An ID duplicate cannot be marked as"matched","addition","deletion", or"changed"because there will be multiple value matches when comparingdf1anddf2. The user should fix these errors and re-run the comparison."ID contains NA": a row in eitherdf1ordf2which had anNAvalue in one or moreid_cols. The comparison procedure requires non-NAvalues for allid_colstherefore these rows will not be marked as"matched","addition","deletion", or"changed". The user should fix these errors and re-run the comparison.
source: this column is only show in the data top-bottom sheet and tells you whether the data in the row came fromdf1ordf2
The next set of columns are hidden by default, but can be shown by clicking the “+” in the top left corner of the worksheet:
change group: ifdiscrepancycontains"changed"this column will be populated with an arbitrary number that links changed records betweendf1anddf2exact dup cnt: ifdiscrepancycontains"exact duplicate"this column gives the number of exact duplicates indf1ordf2ID dup cnt: ifdiscrepancycontains"ID duplicate (not exact duplicate)"this column gives the number of ID duplicates indf1ordf2
Using the report
Here is one way you can use the report that I find effective:
If you have any
"ID contains NA"values in thediscrepancycolumn, start by fixing those in your raw data, then re-run the report. If these only exist in the older version of your data (df2), then maybe you don’t need to fix these because you are planning on replacing the old version with the new version (df1).If you have any
"ID duplicate (not exact duplicate)"values in the discrepancy column, investigate why you have rows indf1and/ordf2that have the same ID values but different entries in the other columns. Similar to the step above, if these problems are only in your older version (df2), maybe you don’t need to do anything.Filter the
discrepancycolumn to find any entries containing"exact duplicate". Depending on your project, exact duplicates might be okay but, you might want to check there are the correct number of duplicates using theexact dup cntcolumn. If exact duplicates are not okay, remove those from your raw data then re-run the report.Filter the
discrepancycolumn to find any entries containing"addition"and"deletion". Check that the ID values are correct. If the ID values are incorrect or misspelled, it is possible that these rows are not true additions and deletions because their ID’s are supposed to match but do not, in which case these rows should actually be categorized as"matched"or"changed". If the ID values look good, confirm that the deleted rows are supposed to be deleted and the added rows are supposed to be added.Filter the
discrepancycolumn to find any entries containing"changed"and verify that all changes are expected. Fix any “bad” changes in the raw data and iteratively re-run the report. In some cases, filtering on thechange groupcolumn can be helpful.Filter the
discrepancycolumn to find any entries containing"matched"and verify that these rows were not supposed to change between versions. Fix the issues and re-run the report iteratively.
