3

I've trying to join 3 dataframes, however I'm having problems to do it. Let me show the scenario.

I have 3 dataframes:

First: Country

just two columns: Country_ID, Country_Name Primary Key: Country_ID

Country_ID | Country_Name

BR     |   Brazil

Second: Sports

just three columns: Country ID, Sport_ID, Sport_Name Primary Key: Country_ID, Sport_ID

Country_ID | Sport_ID| Sport_Name

BR         |    1234     |     Football

Third: University

just four columns: Country_ID, University_ID, University_Name Primary Key: Country_ID, University_ID

Country_ID | University_ID| University_Name

BR         |    UFCABC        | Federal University of ABC

Final Result: just these columns: Country_Name, Sport_Name, University_Name

Country_Name | Sport_Name | University_Name

Brazil | Football | Federal university of Brazil

I tried to join ContryXSport and after that with university DataFrame, however I wasn't able to do it.

Here is the code for creation and join the dataframes:

country_raw_data = {
            'country_id': [country.id for country in countries],
            'country_name': [country.name for country in countries]                                  }

sport_raw_data = {
            'country_id': [sport.country.id for sport in sports],                
            'sport_id': [sport.id for sport in sports],
            'sport_name': [sport.name for sport in sports]
        }

university_raw_data = {
            'country_id': [university.country.id for university in universities],                
            'university_id': [university.state.id for university in universities],
            'university_name': [university.name for university in universities]
        }

Now, the dataframe instances:

I tried to create df like this:

country_df = pd.DataFrame(country_raw_data, columns: ['country_id', 'country_name'])

I don't know why, the country_df was create wrong, some columns didn't appear correct with the right values. Then I had to create like this, it is working.

country_df = pd.DataFrame(country_raw_data) 
sport_df = pd.DataFrame(sport_raw_data) 
university_df = pd.DataFrame(university_raw_data) 

Here, is the joins declarations:

I tried to do like this, however the result didn't join correct. Some columns didn't include the data frame correctly.

country_state_df = pd.merge(country_df, state_df, on='country_id', how='inner')

Another code tht I did it, however I had the same problem before:

country_sport_df = pd.merge(country_df, sport_df, 
                        left_on='country_id', 
                        right_on='sport_id', 
                        how='inner')

So, after the first join, I did the next join between country_state with city

country_sport_university.df = pd.merge(country_sport_df, university_df, 
                             on=['country_id', 'country_id'], 
                             how='inner')

I'd like the final result must be like these columns:

country_name | Sport_Name | University_Name

Brazil | Football | Federal University of ABC

it is possible to do it using dataframe, or I need to use another libraries?

So, there are a lot of data, around millions of data, for example.

Can anyone help my or give me a suggestion to solve the problem?

Thank you very much!

Kadu
  • 161
  • 1
  • 13

1 Answers1

2

You should be able to:

country_sport_df = country_df.merge(sport_df, on='country_id', how='inner')
country_university_df = university_df.merge(sport_df, on='country_id', how='inner').drop(['country_id', 'sport_id', 'university_id'], axis=1)

I assume that it's on purpose that country_id is the only link between sport_id and university_id.

Stefan
  • 35,233
  • 11
  • 66
  • 76
  • I didn't realize you could chain drop. Would this prevent the suffixes for common columns? – trench Jan 18 '16 at 22:52
  • Stefan, thank you for helping me. It's working now. I found a good documentation about this: [link](http://pandas.pydata.org/pandas-docs/stable/merging.html) – Kadu Jan 18 '16 at 22:58