Skip to contents

Separates a data frame into sub-data frames based on types of row duplication or lack thereof. Also identifies NA values in id_cols.

Usage

find_dups(df, id_cols, source)

Arguments

df

a data frame with or without duplicated rows

id_cols

a character vector of the column names that form a unique ID

source

a string, name of the data source (usually "df1" or "df2")

Value

a names list with four elements:

  • exact_dups: rows which were duplicated for every value, there will be one distinct row in exact_dups for each case and the column n will list how many occurences of the row were in df.

  • id_dups: rows which were duplicated based on id_cols but had other values that differed, the column n contains how many ID duplicates of one ID exist in df

  • id_NA: rows which had an NA value in a id_cols column

  • no_dups: this is df with all id_dups row removed and only one copy of each exact_dups row.