I encountered the problem in my recent project where I was given 2 huge spreadsheets to find out the attributes in each entity. This was a data modeling supplemental exercise where I had to list down the entities where the attribute belong. The first thing which came to my mind was VLookup, but I was told that there are duplicate values. For example, Party ID attribute may reside in both Party entity and Customer entity and since Excel VLookup only returns the first value, it could not be applied.

I created a custom macro and automate it anyway, but during my journey on learning Python, I thought whether I can solve the problem using Python, NumPy and Pandas. Let me break it down to problem statement, data sample and solution.

Problem Statement

The data sample looks like the following. Notice that I have master data on the left and sales figures on right. Bhaskar has 2 sales and the normal VLookup function will not yield this. Hence I have to use a customized macro in excel or a Python code. My output should look like below.

Thought Process and Solution

This was clear on my mind that I need to merge the 2 sheets after I capture them in DataFrames. But merging them would give me 2 separate rows, and not a comma-separated value. So, the first part was easy till now –

df1 = pd.read_excel('vlookupEx.xlsx',sheet_name='Sheet1')
df2 = pd.read_excel('vlookupEx.xlsx',sheet_name='Sheet2')
df3 = pd.merge(df1, df2, how ='inner', on ='Name')
df3

The thought process now is how my brain perceives this information and arrange in comma-separated format. So, I see the first column ‘Name’ and try to find out whether there are other instances. If there are other instances, then I append it in the same row I have in the first instance with a comma.

In Python terms, I have to create a list of dictionaries. where the key will be ‘Name’ and Value will be ‘Sales’. I then have iterate the list to find whether I have the key. If I don’t have the key, then I insert it in a new dictionary. If I already have the key, then I append the value in the existing key, separated by a comma.

Dict:dict={} # Create a new dictionary
List:list=[] # Create a new list
for item in pdx.itertuples():
    List.append({item[1]:str(item[3])}) # For each item in the DataFrame, create a list of dictionaries

List # Call the list and see the output -
# [{'Bhaskar': '100'}, {'Bhaskar': '400'}, {'Sutapa': '200'}, {'Binoy': '426'}]

Now comes the tricky part. I create an empty dictionary Output and then take the first item of the list, which is a dictionary. Then I take the first item of the list and check for existence. If it’s not found then goes to the else block and I add it in the dictionary. If it is found then it goes in the if block and I append it with a comma to the value of the list. The below picture shows a step-by-step representation with print.

Final output
Step-by-step iteration result

So my programming work is done. I have managed to get the output I require. The next step would be putting this dictionary in a DataFrame and merging it back with my df3 DataFrame. Once that is done, I have to do some deduplication and formatting before I put it back as an excel output.

excel op

dfOp=pd.DataFrame(Output,index=[0])
dfOp=dfOp.T
dfOp=dfOp.reset_index()
dfOp=dfOp.rename(columns={'index': "Name"})
finalOutput=pd.merge(dfOp, df3, how ='inner', on ='Name')
finalOutput.drop_duplicates(subset=['Name'],inplace=True) # Inner join returns 2 rows for 'Bhaskar'. So removing duplicate
finalOutput.drop('Sales', axis = 1, inplace = True) # I finally don't need the sales as it is already consolidated
finalOutput.to_excel('vlookupOp.xlsx',sheet_name='Sheet1')

Conclusion

This problem kept me awake for couple of nights! At my level as Python programmer, it wasn’t easy unless my friend and colleague, the ‘Code Wiz’ Arghyadeep Chaudhury did not direct me to the right direction.

There are certain outliers to this solution. The code does not consider any strikethrough or color-coded columns in excel which are marked for omission. These can be captured and excluded in an excel function, but not in Python.

I think there must be some easy solution to it. If you have a better, elegant, alternate solution, please leave a message in the comment box below.

2 COMMENTS

  1. Why not use a simple loop to append unique values for each id on the 2nd table first. With now only 1 record per name, do a straight join with the first ?

    • Hi Sootanu,
      Thanks for the comment. In Pandas, importing an excel will provide its own index which can be used as unique values/surrogate keys. Even if I join them using the Name column, I would have ended up with 2 matching rows for the name ‘Bhaskar’ as I have demonstrated in df3. The tough part is to transpose them as comma-separated output. I thought of using ‘Group By’, but a group by will require an aggregate function like sum(), count(), mean(), etc. Hence this is what I came up with. I think Map() can be used here, but I am not sure enough to yield the desired output.

LEAVE A REPLY

Please enter your comment!
Please enter your name here