I have details relating to over 700,000 customer contacts. Our customers may use more than one address (e.g. home and business) and more than one telephone number (e.g. mobile or landline). It is also possible that there may be more than one name (e.g. mis-spelling) recorded for a particular person. I have identified 511,000 unique combinations of name / address / telephone number. Also, there are 4,000 unique names, 210 unique telephone numbers and 2,000 unique addresses.
My challenge is to identify all on the customer contacts that may be related to a single customer. I have a file which lists all 700K contacts, along with the unique identifier created for each combination of name / address / telephone number. We also know all of the names, addresses and telephone numbers associated with that contact.
In testing, and on small scale data, the following macro works well. However, it is proving impossible for the macro to manage the entire dataset as a single entity. Is there another, more efficient, way that this process could be written in SPSS?
Thank you in anticipation.
SET MITERATE 1000001. DEFINE !Repeat_Contact(). !Do !Repeat_Contact = 1 !TO 155649.
DATASET ACTIVATE CONTACT_DATA_FILE. DATASET COPY Nominal_DATA. DATASET ACTIVATE Nominal_DATA.
COUNT TARGET=UNIQUE_REF to END_1 !CONCAT ("('","URN_",!Repeat_Contact,"')"). IF (TARGET ge 1) Group_URN = !QUOTE (!CONCAT ("Group_URN_",!Repeat_Contact)). SELECT IF Group_URN = !QUOTE (!CONCAT ("Group_URN_",!Repeat_Contact)).
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=Group_URN /Num_Link_Contacts=N.
Select if (Num_Link_Contacts ge 5). Execute.
DATASET ACTIVATE Output_TEMPLATE. ADD FILES /FILE=* /FILE='Nominal_DATA'. EXECUTE.
DATASET CLOSE Nominal_DATA. !DOEND. !ENDDEFINE.
RUN MACRO!. *SET MPRINT ON. !Repeat_Contact.
Answer by jkpeck (6143) | Sep 20, 2016 at 09:14 AM
I do agree that this macro is inefficient. For starters, it is copying the whole input dataset a huge number of times. Answers to a few questions might help to craft a more efficient way.
What constitutes a match? What is the framework? Would you start with one basic dataset of customers that has all these variables and then look for matches in the contact database with some number of the attributes? One approach might be to convert the names to a canonical representation such as NIISYS and then look for matches? or use some string similarity measure?
What is the penalty for error, particularly if you miss a contact that should belong to a known customer. If this is for statistical purposes, perhaps some errors are accceptable.
use Syntax/Macro to automate generation of file names 4 Answers
HELP: automate file selection 17 Answers
Output Export Problem 3 Answers
How to use custom dialog variable list in CTABLES /TABLE subcommand 2 Answers
Conditional poisson regression 1 Answer