Help to create a script that recognises a text sequence

Posted over 3 years ago by Magali Colin - Avizua

Post a topic
Solved
Magali Colin - Avizua
Magali Colin - Avizua

I have to do something that requires, I think, two scripting exercises.

The first part is to build a correspondence table.

For example: a file contains several thousand addresses (see an extract with the attached Excel). I need to group the addresses by similar sequences in the wording.

This attached drawing explains what I need.

This will save time in making a mapping table which can then be used in a join.... for which I need another block (another script).

I will write another ticket (called Help to create a join block by "one text contains another text") for the join.

Thanks in advance for help.

Best Regards

Magali

 

 

 

 

 

 

 

 

 

 

 

 

0 Votes


7 Comments

Sorted by
P

Paola Tomei posted 8 months ago Admin

Hi Magali,


It would be good to know whether you've had a chance to review the solution and if it deals with the requirement?


Thanks

Paola

0 Votes

P

Paola Tomei posted over 3 years ago Admin

You can use the ETL process to break the string into individual words, merge them with the keywords list, then join the result with the original data.

Nice feature here is that if the text contains multiple keywords - this will be captured and multiple words will be listed, comma separated in the aggregated field.

See the IOZ attached.


0 Votes

Magali Colin - Avizua

Magali Colin - Avizua posted over 3 years ago

I tryed with a "left" join instead of a "right" one but it does not work finally :-(

Another idea ?

Magali

0 Votes

Magali Colin - Avizua

Magali Colin - Avizua posted over 3 years ago

Hi Antonio,

I understand why the result is not satifying : it keeps only one address line, and I need to keep all the adresses

I made a short exemple in the attached IOZ to explain :

image


Best Regards,


Magali

0 Votes

Magali Colin - Avizua

Magali Colin - Avizua posted over 3 years ago

Thanks Antonio.

However the "fuzzy match" matches for 313 records on 1609... the idea of trying to isolate a sequence that repeats may be better when both entry are not always in the same language.

Magali

0 Votes

Antonio Poggi

Antonio Poggi posted over 3 years ago Admin

As per screenshare, you can exploit the community Fuzzy Join block to merge datasets based on similar text. It performs a join between the first (left) and second (right) input. The field on which the join is performed must be text containing multiple terms. 

The result will contain joined records based on how many terms they share, weighted by inverse document frequency ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf )


Also you can leverage the fuzzy match in the Record filter but if you have many rules to set up that could be a bit tedious to set up and maintain.

1 Votes

Magali Colin - Avizua

Magali Colin - Avizua posted over 3 years ago

These are the files

0 Votes

Login or Sign up to post a comment