0

I have big file which contains string information : postal addresses. Address example : "1780 wemmel rue hendrik de mol 59/7"

I need to do a PCA analysis on that Data in order to identify on the individuals graph the clusters that represent the physicals delivery posts (building, companies, ...). To do that I need to extract numeric (or not numeric) relevant information from the strings and make it my attributes, then I can analyze it using PCA.

I started with creating 36 attributes (A-Z and 0-9) that represent the occurrence of each alpha character and digit. But the PCA doesn't give a good result yet, I need to extract more attributes that can characterize the Data.

I need your ideas about what I can extract from the Data to have a good representation of the clusters on the individual graph. I'm using R.

Thank you.

  • It's not clear for me what you want to achieve with PCA. Can you give more detailed description of your task? – cyberj0g Jun 17 '15 at 09:58
  • Actually my Data have been collected in mail letters. I want to use PCA to explore the individual graph and thus identify some eventual clusters that might represent the physicals delivery posts which is the building. Each cluster would contain the different addresses relatives to that building. I chose PCA because I don't know other method to do that. If you think about something else, please tell me about it. – Taoufiq Mouhcine Jun 17 '15 at 10:10
  • This might be of interest: http://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont –  Jun 17 '15 at 10:12
  • Thanks for your answer. Actually, I don't have categories like (street, avenue, ...) in a variable. I have a string Data of addresses. Even though there are many addresses which are the address of the same person, they are not the identical strings because they've been collected in letters. – Taoufiq Mouhcine Jun 17 '15 at 10:23

1 Answers1

2

I think that task is not for PCA. I would first try to introduce some kind of distance measure between 2 addresses. You can either use entire address as a single feature - then there're plenty of general-purpose string similarity measures, for example Levenshtein distance. There's a method in utils package. Or introduce more features, like number of building, postal code, etc. and use combination of Euclidean distance and text-similarity distance. Your 36 variables seem too much for the task. Anyway, your distance measure should give small value for 'close' addresses and large value for irrelevant addresses in your domain.

After deciding on distance measure and choosing features, apply k-means clustering with custom distance function to your data. You can use flexclust package for that. Nice suggestions for determining number of clusters can be found here.

With that you'll likely find your clusters. Good luck.

Community
  • 1
  • 1
cyberj0g
  • 3,467
  • 1
  • 15
  • 32