Skip to content

adamdempsey90/StarClusters

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Finding Star Clusters in Gaia data with Pandas and Scikit-Learn

Introduction

Young star clusters and stellar associations are important sites for understanding the stellar birth environment and stellar evolution. The Gaia mission will provide the 3D location and proper motions for over 1 billion stars, making it the perfect telescope to discover new star clusters and further characterize known clusters.

This activity uses a piece of the initial Gaia data release as input for a machine learning clustering algorithm to identify stellar overdensities.

For this notebook, we're using NumPy, Matplotlib, Pandas, Scikit-learn, and Astropy. If you don't have these you can install them with pip. On the commmand line type:

pip install --upgrade numpy
pip install --upgrade astropy
pip install --upgrade scikit-learn
pip install --upgrade pandas
pip install --upgrade matplotlib

The Data

This dataset combines Hipparcos and Tycho-2 data with new Gaia observations to provide accurate 3D positions and proper motions on the sky -- no radial velocities yet, though!

All of the data can be downloaded in chunks from the Gaia website, here. A description of all of the columns can be found here.

I've taken the relevant columns and put everything into a Pandas DataFrame. This is compactly stored in the HDF5 data file, alldata.hdf, which you should download from here and place in this directory (note that the file is 285 MB).

Load the data by reading it in with Pandas,

import pandas as pd
dat  = pd.read_hdf('alldata.hdf')

You can think of this simple Pandas DataFrame as a large table which has built in functions to process rows and columns, read and write to many different formats on disk, and interact with other DataFrames.

Try printing the DataFrame to see a small sample.

print(dat)
                 source_id  ref_epoch          ra  ra_error       dec  \
0            7627862074752     2015.0   45.034330  0.305989  0.235392   
1            9277129363072     2015.0   45.165007  2.583882  0.200068   
2           13297218905216     2015.0   45.086155  0.213836  0.248825   
3           13469017597184     2015.0   45.066542  0.276039  0.248211   
4           15736760328576     2015.0   45.136038  0.170697  0.335044   
5           16527034310784     2015.0   45.141378  0.205008  0.359618   
6           16733192740608     2015.0   45.152959  0.189524  0.386343   
7           16870631694208     2015.0   45.112779  0.206981  0.380844   
8           26834955821312     2015.0   45.010270  0.243576  0.351099   
9           33260226885120     2015.0   44.974659  0.267911  0.473472   
10          44358422235136     2015.0   45.501447  0.134010  0.497704   
11          51745765982848     2015.0   45.467165  0.146028  0.655080   
12          82463372227072     2015.0   44.866247  0.160620  0.561507   
13          83150566993664     2015.0   44.937343  0.208963  0.623838   
14          86105504493056     2015.0   44.797149  0.794706  0.630677   
15          92599494964480     2015.0   44.414969  0.205710  0.567882   
16         111600430360448     2015.0   44.918310  0.191469  0.755087   
17         115723598973952     2015.0   45.105609  0.168169  0.812344   
18         120190364960512     2015.0   45.008011  0.170795  0.888580   
19         122732985598464     2015.0   45.196000  0.213285  0.993305   
20         132662949987200     2015.0   45.072787  0.213813  0.969997   
21         134999412195456     2015.0   45.118602  0.324832  1.046581   
22         139363098967552     2015.0   45.025729  0.219474  1.110209   
23         142730353190144     2015.0   45.667584  0.156331  0.712084   
24         148949465832832     2015.0   45.656687  0.250327  0.848682   
25         180045029052928     2015.0   46.123388  0.151871  1.066354   
26         189871914442624     2015.0   46.191486  0.159092  1.190304   
27         202619377158912     2015.0   46.044931  0.212695  1.230477   
28         219558728443008     2015.0   45.356442  0.365251  1.156174   
29         223407019139328     2015.0   45.580692  0.376271  1.194216   
...                    ...        ...         ...       ...       ...   
34045  6917296790170220032     2015.0  314.765232  0.363478 -1.194010   
34046  6917297683523416064     2015.0  314.735773  0.440119 -1.171667   
34047  6917321597900574720     2015.0  313.972540  0.156991 -1.338388   
34048  6917329603720362112     2015.0  314.274696  0.251578 -1.198183   
34049  6917333933046642944     2015.0  314.155273  0.348785 -1.144750   
34050  6917335410515389696     2015.0  314.141918  0.209730 -1.063458   
34051  6917339842921641984     2015.0  313.842549  0.206049 -1.140866   
34052  6917342557341568384     2015.0  313.778438  0.450388 -1.134689   
34053  6917350322641840128     2015.0  313.814036  0.222999 -1.020702   
34054  6917355270444915200     2015.0  314.297637  0.383452 -1.046371   
34055  6917371831838057728     2015.0  314.078916  0.381021 -0.989514   
34056  6917375955006659200     2015.0  314.068253  0.445682 -0.900887   
34057  6917383239271187712     2015.0  314.351767  0.646789 -0.726464   
34058  6917389252225936128     2015.0  315.046298  1.192695 -1.112642   
34059  6917402171487559808     2015.0  315.149614  0.284295 -0.899310   
34060  6917406638253525760     2015.0  314.878231  0.225291 -0.966348   
34061  6917412960445383040     2015.0  314.818544  0.286806 -0.839533   
34062  6917413510201195904     2015.0  314.858023  0.183427 -0.804869   
34063  6917420519587820928     2015.0  314.836804  0.224744 -0.754084   
34064  6917428319248453760     2015.0  315.270266  0.312305 -0.768764   
34065  6917452439784507648     2015.0  315.410962  0.665954 -0.478297   
34066  6917474670535481728     2015.0  314.853264  0.202654 -0.539717   
34067  6917476319802433536     2015.0  314.459750  0.968369 -0.724208   
34068  6917487486717891328     2015.0  314.766314  0.301253 -0.473161   
34069  6917487967754227584     2015.0  314.812092  0.305573 -0.439632   
34070  6917488998546378368     2015.0  314.741700  0.277332 -0.423197   
34071  6917493705830041600     2015.0  314.643818  0.359097 -0.324975   
34072  6917504975824469248     2015.0  315.282880  0.472265 -0.343177   
34073  6917517998165066624     2015.0  314.740648  0.192053 -0.228114   
34074  6917521537218608640     2015.0  314.960731  0.491821 -0.221301   

       dec_error   parallax  parallax_error        pmra  pmra_error  \
0       0.218802   6.352951        0.307910   43.752313    0.070542   
1       1.197789   3.900329        0.323488   10.036263    4.611414   
2       0.180326   3.155313        0.273484    2.932284    1.908644   
3       0.200958   2.292367        0.280972    3.661982    2.065052   
4       0.170130   1.582077        0.261539    0.340802    1.220476   
5       0.179848   8.663080        0.255867  -52.849281    1.264286   
6       0.171880   5.698205        0.263677   24.527237    1.177377   
7       0.150943   2.090812        0.222206   -1.572928    1.733194   
8       0.169345   6.202492        0.247253   26.308009    2.034845   
9       0.163531   1.677673        0.222067    9.104210    2.209395   
10      0.395606   9.957504        0.547771   25.103704    0.081780   
11      0.151794   5.032630        0.230189   97.761970    1.199557   
12      0.151276   2.458658        0.218785   -6.289263    1.458177   
13      0.157091  15.728640        0.229521  -98.248124    0.181544   
14      0.485048   4.695800        0.956438    9.533592    1.084953   
15      0.168425  11.253095        0.267342  123.868785    1.319942   
16      0.181781   2.785788        0.268922   29.299627    1.470242   
17      0.157115  10.549845        0.231620    7.810436    1.564040   
18      0.164587   4.198517        0.250279  -59.555191    1.192541   
19      0.206712   6.581782        0.303335   16.400616    0.801764   
20      0.165124   5.746190        0.251024   29.368910    0.084182   
21      0.235454   1.186701        0.301492    1.118005    2.366421   
22      0.215515   2.081689        0.313718  -14.473087    1.497436   
23      0.155345   5.400538        0.245903   80.425517    1.341393   
24      0.211829   1.512226        0.332788   -5.676397    0.124096   
25      0.532063   2.986565        0.776729   -5.716453    1.057208   
26      0.571938   0.656160        0.845735    5.476655    1.390180   
27      0.629442   2.872148        0.906019    2.199605    1.413306   
28      0.351513   4.740228        0.522721   17.693950    0.050109   
29      0.157121   0.846912        0.405900    5.308157    7.278217   
...          ...        ...             ...         ...         ...   
34045   0.212978   3.050534        0.741126   18.620553    2.685246   
34046   0.238175   1.496255        0.888100   27.513501    3.228359   
34047   0.123209   2.869659        0.281346   18.935067    0.932710   
34048   0.608068   2.031372        0.599195  -15.402302    0.948013   
34049   0.337395   8.599250        0.526418  -22.592537    1.497488   
34050   0.201801   3.199663        0.228593   -4.313455    0.076549   
34051   0.160198   3.089541        0.411665   -0.877548    1.423813   
34052   0.213632  -0.052200        0.944066   15.213932    3.458697   
34053   0.139983   1.414788        0.459721  -14.390107    1.650502   
34054   0.173235   1.441584        0.777750   -2.023907    2.858043   
34055   0.224392   3.054000        0.724973   10.703563    2.580172   
34056   0.437163   5.024796        0.675503    2.588341    1.231998   
34057   0.404334   8.092877        0.477413   41.053654    1.120982   
34058   0.770643   0.888197        0.932313    3.450822    2.165659   
34059   0.149646   3.290275        0.563243   -7.657043    2.042954   
34060   0.160218   1.611850        0.458997    2.983173    1.639595   
34061   0.257089   0.445328        0.427754    0.736633    1.301267   
34062   0.179483   3.078234        0.335764   -2.737049    1.098831   
34063   0.204943   3.427077        0.399098    2.650628    1.313554   
34064   0.190012   3.169772        0.302901    0.088162    0.075149   
34065   0.389449  23.499923        0.516988  159.970016    0.147853   
34066   0.176745   6.082151        0.276084   10.011422    0.756730   
34067   0.589946  13.715043        0.691190    4.837336    1.766099   
34068   0.167501   7.962663        0.588791   21.825678    2.106624   
34069   0.295601   0.785996        0.446457   -5.483699    0.893791   
34070   0.249741   5.074307        0.284089   21.970773    0.078469   
34071   0.148234   1.698348        0.741014   -1.283561    2.741681   
34072   0.209030   6.036938        0.396880   15.713556    0.941184   
34073   0.131650   1.484142        0.348601   11.352889    1.219847   
34074   0.268675   2.680111        0.450774    2.897879    1.035482   

            pmdec  pmdec_error  phot_g_mean_mag           l          b  \
0       -7.641990     0.087402         7.991378  176.740413 -48.714422   
1      -55.109173     2.522929        10.580959  176.916420 -48.645004   
2       -1.602867     1.035259        10.743102  176.780400 -48.667845   
3      -18.414912     1.129851        11.075682  176.760412 -48.682365   
4       -2.379387     0.710632        10.168701  176.739184 -48.572035   
5      -72.711396     0.720852         9.971989  176.718131 -48.551099   
6       -3.350355     0.707184         9.873996  176.701325 -48.524183   
7      -11.661616     0.982994        10.561042  176.665032 -48.556851   
8        9.195405     1.028325         9.970699  176.589274 -48.651100   
9      -24.526435     1.131898         9.790358  176.419105 -48.591185   
10     -23.671980     0.077636         9.317437  176.945298 -48.196220   
11     -20.939852     0.623911        11.259156  176.739930 -48.111535   
12     -12.475993     0.810281        10.555697  176.209299 -48.607023   
13    -160.800170     0.167036         9.482735  176.217377 -48.512626   
14     -50.587002     2.401721        10.276479  176.061442 -48.607747   
15      66.745085     0.738076         9.854890  175.721092 -48.923440   
16     -29.862607     0.844261        11.263437  176.056139 -48.434129   
17     -82.990462     0.908631        10.787842  176.192429 -48.260497   
18     -64.632961     0.649234         9.215294  176.007954 -48.276522   
19     -24.100299     0.538782         8.774116  176.094063 -48.069158   
20     -93.858217     0.082084         7.385724  175.989318 -48.173229   
21       4.687236     1.275574        11.914900  175.955948 -48.086815   
22      -0.216158     0.832325        11.533558  175.790413 -48.107992   
23      12.644391     0.736791        11.319064  176.886896 -47.927829   
24       0.735340     0.100249         9.713120  176.729163 -47.840718   
25       4.547636     0.595603        10.231507  176.977602 -47.353826   
26     -28.331014     0.779932        11.557126  176.915611 -47.218804   
27     -22.447649     0.825536        11.091142  176.723034 -47.296261   
28      -1.378007     0.048145         7.342878  176.088911 -47.840582   
29      -5.454302     2.371843        11.405336  176.282250 -47.653997   
...           ...          ...              ...         ...        ...   
34045   -7.079728     1.432224        10.561375   47.615439 -28.670988   
34046   -6.010818     1.717241        11.167043   47.620299 -28.634266   
34047    3.312674     0.601500         9.536238   47.017862 -28.060733   
34048  -25.618008     1.230587         9.024313   47.328258 -28.250516   
34049  -57.688259     1.209436         9.999796   47.311921 -28.120502   
34050    1.020499     0.049310         7.649970   47.383697 -28.067801   
34051   -7.093075     0.827746         9.743663   47.136610 -27.848831   
34052   -7.058379     1.799592        10.925318   47.106022 -27.790394   
34053  -40.097915     0.896231         9.885277   47.237516 -27.763475   
34054   -8.061882     1.484723        10.765187   47.489957 -28.193303   
34055  -14.718697     1.455904        11.237844   47.419708 -27.976012   
34056   10.973142     1.711955         9.977519   47.500032 -27.921847   
34057  -54.221917     1.510662         9.524471   47.833480 -28.077103   
34058    5.184958     2.950987         9.767707   47.858589 -28.871391   
34059   -9.955667     1.103737        10.744613   48.128243 -28.851157   
34060   -5.810293     0.896728         9.837787   47.904294 -28.652225   
34061  -13.248180     0.999058         8.940431   47.993742 -28.536141   
34062  -23.098967     0.711390         9.683264   48.050631 -28.552341   
34063  -13.938733     0.864367         9.699898   48.087940 -28.508127   
34064   12.743394     0.043069         8.065196   48.326862 -28.887799   
34065 -133.586663     0.087725        10.029232   48.693943 -28.858900   
34066  -18.307018     0.678115         9.000583   48.306903 -28.412379   
34067  -61.993429     2.238018         9.838596   47.898156 -28.168809   
34068  -40.087701     1.181743        10.418828   48.321079 -28.303596   
34069   11.000701     1.127911         8.263434   48.380466 -28.325648   
34070   -4.440524     0.047433         8.988852   48.355406 -28.256816   
34071  -10.401225     1.401695        10.324625   48.393942 -28.122375   
34072  -27.852345     1.277858         9.238852   48.750523 -28.679451   
34073    1.847108     0.730717         9.017069   48.544630 -28.155566   
34074    3.151734     1.438891         9.732571   48.680054 -28.340635   

          ecl_lon    ecl_lat  
0       42.641825 -16.121052  
1       42.761180 -16.193033  
2       42.697502 -16.123363  
3       42.677791 -16.118216  
4       42.773370 -16.055481  
5       42.786150 -16.033536  
6       42.805789 -16.011356  
7       42.764140 -16.004886  
8       42.653112 -16.003369  
9       42.654932 -15.875932  
10      43.186350 -16.006199  
11      43.199697 -15.845644  
12      42.573980 -15.760018  
13      42.663563 -15.721228  
14      42.526411 -15.673636  
15      42.127916 -15.621262  
16      42.684529 -15.590152  
17      42.887847 -15.590066  
18      42.814039 -15.488683  
19      43.032261 -15.443258  
20      42.902952 -15.429693  
21      42.971542 -15.369771  
22      42.898652 -15.281855  
23      43.416253 -15.849093  
24      43.446421 -15.715171  
25      43.975631 -15.640883  
26      44.080175 -15.541588  
27      43.946408 -15.461170  
28      43.240515 -15.333975  
29      43.474482 -15.362416  
...           ...        ...  
34045  316.868830  15.262468  
34046  316.846337  15.292399  
34047  316.038920  15.352768  
34048  316.380781  15.400302  
34049  316.278103  15.485835  
34050  316.289092  15.567534  
34051  315.968378  15.579224  
34052  315.906442  15.603468  
34053  315.975706  15.702522  
34054  316.448964  15.539064  
34055  316.248519  15.656465  
34056  316.264372  15.744417  
34057  316.598712  15.829726  
34058  317.172047  15.258519  
34059  317.339125  15.432362  
34060  317.049577  15.447441  
34061  317.028640  15.586138  
34062  317.078328  15.607805  
34063  317.072613  15.662567  
34064  317.498492  15.521843  
34065  317.726766  15.758154  
34066  317.153895  15.862830  
34067  316.706814  15.800650  
34068  317.087587  15.951840  
34069  317.143296  15.970571  
34070  317.078236  16.006807  
34071  317.010546  16.129270  
34072  317.640833  15.924969  
34073  317.136362  16.193729  
34074  317.357581  16.135962  

[2057050 rows x 17 columns]

We see that we have 2,057,050 stars in our dataset, each of which has a measured position on the sky (both (ra,dec) and (l,b)), a parallax, a G band magnitude, and proper motion on the sky (pmra,pmdec). Each of these measurements also has an error associated with it.

We can see how much memory our DataFrame object is taking up with,

print('{:d} rows'.format(len(dat)))
print('{:.1f} MB'.format(dat.memory_usage(index=True,deep=True).sum()/1e6))
2057050 rows
296.2 MB

We can try reducing this by only loading in the columns we'll be working with.

dat  = pd.read_hdf('alldata.hdf',columns=['ra','ra_error','dec','dec_error','parallax','parallax_error'])
print('{:d} rows'.format(len(dat)))
print('{:.1f} MB'.format(dat.memory_usage(index=True,deep=True).sum()/1e6))
print(dat)
2057050 rows
115.2 MB
               ra  ra_error       dec  dec_error   parallax  parallax_error
0       45.034330  0.305989  0.235392   0.218802   6.352951        0.307910
1       45.165007  2.583882  0.200068   1.197789   3.900329        0.323488
2       45.086155  0.213836  0.248825   0.180326   3.155313        0.273484
3       45.066542  0.276039  0.248211   0.200958   2.292367        0.280972
4       45.136038  0.170697  0.335044   0.170130   1.582077        0.261539
5       45.141378  0.205008  0.359618   0.179848   8.663080        0.255867
6       45.152959  0.189524  0.386343   0.171880   5.698205        0.263677
7       45.112779  0.206981  0.380844   0.150943   2.090812        0.222206
8       45.010270  0.243576  0.351099   0.169345   6.202492        0.247253
9       44.974659  0.267911  0.473472   0.163531   1.677673        0.222067
10      45.501447  0.134010  0.497704   0.395606   9.957504        0.547771
11      45.467165  0.146028  0.655080   0.151794   5.032630        0.230189
12      44.866247  0.160620  0.561507   0.151276   2.458658        0.218785
13      44.937343  0.208963  0.623838   0.157091  15.728640        0.229521
14      44.797149  0.794706  0.630677   0.485048   4.695800        0.956438
15      44.414969  0.205710  0.567882   0.168425  11.253095        0.267342
16      44.918310  0.191469  0.755087   0.181781   2.785788        0.268922
17      45.105609  0.168169  0.812344   0.157115  10.549845        0.231620
18      45.008011  0.170795  0.888580   0.164587   4.198517        0.250279
19      45.196000  0.213285  0.993305   0.206712   6.581782        0.303335
20      45.072787  0.213813  0.969997   0.165124   5.746190        0.251024
21      45.118602  0.324832  1.046581   0.235454   1.186701        0.301492
22      45.025729  0.219474  1.110209   0.215515   2.081689        0.313718
23      45.667584  0.156331  0.712084   0.155345   5.400538        0.245903
24      45.656687  0.250327  0.848682   0.211829   1.512226        0.332788
25      46.123388  0.151871  1.066354   0.532063   2.986565        0.776729
26      46.191486  0.159092  1.190304   0.571938   0.656160        0.845735
27      46.044931  0.212695  1.230477   0.629442   2.872148        0.906019
28      45.356442  0.365251  1.156174   0.351513   4.740228        0.522721
29      45.580692  0.376271  1.194216   0.157121   0.846912        0.405900
...           ...       ...       ...        ...        ...             ...
34045  314.765232  0.363478 -1.194010   0.212978   3.050534        0.741126
34046  314.735773  0.440119 -1.171667   0.238175   1.496255        0.888100
34047  313.972540  0.156991 -1.338388   0.123209   2.869659        0.281346
34048  314.274696  0.251578 -1.198183   0.608068   2.031372        0.599195
34049  314.155273  0.348785 -1.144750   0.337395   8.599250        0.526418
34050  314.141918  0.209730 -1.063458   0.201801   3.199663        0.228593
34051  313.842549  0.206049 -1.140866   0.160198   3.089541        0.411665
34052  313.778438  0.450388 -1.134689   0.213632  -0.052200        0.944066
34053  313.814036  0.222999 -1.020702   0.139983   1.414788        0.459721
34054  314.297637  0.383452 -1.046371   0.173235   1.441584        0.777750
34055  314.078916  0.381021 -0.989514   0.224392   3.054000        0.724973
34056  314.068253  0.445682 -0.900887   0.437163   5.024796        0.675503
34057  314.351767  0.646789 -0.726464   0.404334   8.092877        0.477413
34058  315.046298  1.192695 -1.112642   0.770643   0.888197        0.932313
34059  315.149614  0.284295 -0.899310   0.149646   3.290275        0.563243
34060  314.878231  0.225291 -0.966348   0.160218   1.611850        0.458997
34061  314.818544  0.286806 -0.839533   0.257089   0.445328        0.427754
34062  314.858023  0.183427 -0.804869   0.179483   3.078234        0.335764
34063  314.836804  0.224744 -0.754084   0.204943   3.427077        0.399098
34064  315.270266  0.312305 -0.768764   0.190012   3.169772        0.302901
34065  315.410962  0.665954 -0.478297   0.389449  23.499923        0.516988
34066  314.853264  0.202654 -0.539717   0.176745   6.082151        0.276084
34067  314.459750  0.968369 -0.724208   0.589946  13.715043        0.691190
34068  314.766314  0.301253 -0.473161   0.167501   7.962663        0.588791
34069  314.812092  0.305573 -0.439632   0.295601   0.785996        0.446457
34070  314.741700  0.277332 -0.423197   0.249741   5.074307        0.284089
34071  314.643818  0.359097 -0.324975   0.148234   1.698348        0.741014
34072  315.282880  0.472265 -0.343177   0.209030   6.036938        0.396880
34073  314.740648  0.192053 -0.228114   0.131650   1.484142        0.348601
34074  314.960731  0.491821 -0.221301   0.268675   2.680111        0.450774

[2057050 rows x 6 columns]

We cut the size of our DataFrame in half!

Pandas indexing

One thing that makes pandas great for analyzing large datasets is the way it stores the data. Rather than using a normal list or numpy array, pandas uses a hashtable to quickly find a particular row of the data. We can see the keys for this hashtable as the row numbers when we print our pandas DataFrame object.

We can see all of these indices for our DataFrame by accessing the index attribute of the DataFrame,

print(dat.index)
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            34065, 34066, 34067, 34068, 34069, 34070, 34071, 34072, 34073,
            34074],
           dtype='int64', length=2057050)

So pandas uses a list of integers to label the rows of our data. We can get a little more information by sorting this list before printing it,

import numpy as np
print(np.sort(dat.index))
[     0      0      0 ..., 134864 134864 134864]

Hold on, something doesn't look quite right. We have duplicate indices for different rows of our data! This is one thing we have to careful about when we feed pandas a large dataset -- the hashtable may have non-unique keys.

Luckily, we can fix this by simply recalculating the indices after we load our data,

dat  = pd.read_hdf('alldata.hdf',columns=['ra','ra_error','dec','dec_error','parallax','parallax_error'])
dat = dat.reset_index()
dat.index
RangeIndex(start=0, stop=2057050, step=1)

It looks like we have unique indices now!

Now we can move on to visualizing our data.

Visualizing our data

The DataFrame object contains some built-in convienence functions for quickly getting a sense of your data. For example, we can quickly make histograms of different columns with the dat.hist() method,

import matplotlib.pyplot as plt
%matplotlib inline
fig,axes = plt.subplots(1,3,figsize=(17,3))
dat.hist(['ra','dec','parallax'],ax=axes,xlabelsize=15,ylabelsize=15,bins=50);

png

We can see right away that Gaia is an all sky survey as it covers the full range of right ascension and declination. The parallax histogram looks a little funny though. There seems to be some bad parallax data that we should remove before proceeding. Remember that parallax is related to distance via

$$\text{distance in pc} = 1 / (\text{parallax in arcsec}) .$$

The Gaia parallaxes are reported in milliarcsec, and so the distances will be in kpc. From looking at the histogram for parallax we see two problems. The first is there are a number of negative parallaxes. We can verify this by slicing our DataFrame object to only show the rows where the parallax is negative,

print(dat[dat.parallax<0])
         index          ra  ra_error        dec  dec_error  parallax  \
783        783   45.515296  0.141345   5.989946   0.346414 -0.012890   
810        810   44.974303  0.520221   6.200443   0.414850 -0.014806   
1031      1031   51.650489  0.814994   5.706888   0.612814 -1.407067   
1245      1245   51.340087  0.691294   8.748267   0.362725 -0.298734   
2443      2443   37.129398  0.831394  10.069239   0.399300 -0.133778   
3822      3822   46.991066  0.629890  16.730952   0.225115 -0.224989   
3897      3897   43.298678  0.480449  16.169881   0.374759 -0.111915   
4086      4086   56.877414  1.169436  11.461870   0.508449 -0.572870   
4191      4191   55.527342  0.449842  11.467865   0.161759 -0.261065   
4530      4530   56.914127  0.668079  14.841630   0.171766 -2.472477   
4988      4988   55.946968  0.942106  16.561471   0.226408 -0.291551   
5100      5100   54.796546  0.780839  16.734744   0.398559 -0.532618   
5270      5270   55.769354  0.439302  19.020534   0.217892 -0.325832   
5632      5632   65.286445  1.140493  19.601063   0.749313 -0.255108   
7086      7086   54.880870  0.869912  19.685986   0.315988 -0.152013   
7334      7334   48.250066  0.791378  18.115850   0.307591 -0.159963   
7482      7482   49.470113  0.764500  19.761699   0.258826 -0.052431   
7829      7829   52.194308  1.188875  22.436005   0.152018 -1.705954   
8084      8084   55.344026  3.121221  20.988876   1.416083 -0.643785   
8443      8443   56.009051  3.420429  23.599998   1.569564 -0.481137   
8538      8538   59.577674  1.184833  23.312322   0.180246 -0.564561   
9387      9387   51.886178  0.515588  25.332011   0.479561 -0.039135   
9542      9542   56.062806  0.776669  25.073097   0.163101 -0.332213   
9880      9880   55.019239  0.725582  26.315274   0.122226 -1.021263   
9934      9934   55.912590  0.852220  27.114028   0.157410 -0.190852   
10041    10041   56.636902  0.296163  27.668668   0.293690 -0.349588   
10115    10115   55.745582  0.312157  28.210019   0.204863 -0.262956   
10179    10179   56.106752  0.909904  29.496826   0.148312 -1.378886   
10377    10377   32.710331  0.525565  12.526121   0.394918 -0.804147   
11359    11359   33.459140  2.648608  18.039652   0.832688 -0.381682   
...        ...         ...       ...        ...        ...       ...   
2056423  33448  311.248518  0.398961  -6.117082   0.187211 -0.672482   
2056433  33458  311.598598  0.439350  -5.616935   0.208287 -0.155961   
2056452  33477  312.064301  0.242140  -5.430808   0.149086 -0.269706   
2056464  33489  313.668019  0.412905  -5.583705   0.188313 -0.048324   
2056484  33509  314.589281  0.425653  -4.863244   0.176742 -0.055059   
2056499  33524  313.336793  0.434973  -4.755779   0.189681 -0.347159   
2056521  33546  313.344539  0.350363  -3.974262   0.177033 -0.573623   
2056535  33560  311.056699  0.381851  -5.659757   0.191707 -0.240922   
2056548  33573  311.424679  0.418652  -5.259474   0.177835 -0.789527   
2056579  33604  309.973201  0.497932  -5.019053   0.262649 -2.044536   
2056595  33620  309.810772  0.372961  -4.650966   0.161991 -1.649397   
2056599  33624  310.035346  0.435446  -4.345697   0.183701 -0.327885   
2056650  33675  313.086805  0.338832  -3.911582   0.242674 -0.019779   
2056681  33706  311.913629  0.373363  -3.557187   0.192121 -0.255225   
2056691  33716  311.001961  0.468283  -3.685968   0.405888 -0.032160   
2056701  33726  311.239858  0.529183  -3.298700   0.413599 -0.582164   
2056705  33730  311.390080  0.384640  -3.148623   0.329609 -0.928917   
2056714  33739  312.391935  0.356388  -3.165960   0.235713 -0.227645   
2056722  33747  312.564209  0.378476  -3.115382   0.301502 -0.184599   
2056739  33764  312.421300  0.229743  -2.695243   0.178499 -0.001552   
2056760  33785  315.346678  0.508684  -3.933339   0.265979 -0.316290   
2056775  33800  314.851546  0.473330  -3.787378   0.338367 -1.053046   
2056779  33804  315.720378  0.451577  -3.918519   0.269825 -0.123523   
2056796  33821  315.389806  0.385371  -3.517012   0.194703 -1.540204   
2056831  33856  314.920480  0.359593  -3.260620   0.147800 -1.791382   
2056883  33908  315.817813  0.649911  -2.574199   0.308736 -0.148027   
2056952  33977  313.021008  0.472255  -2.582160   0.219522 -0.510647   
2056963  33988  312.370100  0.474772  -2.499009   0.257280 -1.169252   
2057002  34027  314.913688  0.475829  -1.321923   0.203663 -1.614503   
2057027  34052  313.778438  0.450388  -1.134689   0.213632 -0.052200   

         parallax_error  
783            0.538805  
810            0.586321  
1031           0.948178  
1245           0.878693  
2443           0.893831  
3822           0.503557  
3897           0.612425  
4086           0.875119  
4191           0.412225  
4530           0.537089  
4988           0.805282  
5100           0.878620  
5270           0.376044  
5632           0.894750  
7086           0.726433  
7334           0.891198  
7482           0.854152  
7829           0.886046  
8084           0.888988  
8443           0.540520  
8538           0.948216  
9387           0.851018  
9542           0.571965  
9880           0.622645  
9934           0.785840  
10041          0.901389  
10115          0.519180  
10179          0.739238  
10377          0.661857  
11359          0.573779  
...                 ...  
2056423        0.810346  
2056433        0.938573  
2056452        0.465799  
2056464        0.873866  
2056484        0.694494  
2056499        0.921673  
2056521        0.726616  
2056535        0.817068  
2056548        0.884565  
2056579        0.999294  
2056595        0.801551  
2056599        0.933150  
2056650        0.603673  
2056681        0.529171  
2056691        0.891996  
2056701        0.835858  
2056705        0.634763  
2056714        0.704435  
2056722        0.718506  
2056739        0.263805  
2056760        0.970300  
2056775        0.870059  
2056779        0.915759  
2056796        0.790938  
2056831        0.760018  
2056883        0.994737  
2056952        0.960010  
2056963        0.983299  
2057002        0.986499  
2057027        0.944066  

[30840 rows x 7 columns]

We can remove these by using the dat.drop() method,

dat = dat.drop(dat[dat.parallax < 0].index)
print('{:d} rows'.format(len(dat)))
print('{:.1f} MB'.format(dat.memory_usage(index=True,deep=True).sum()/1e6))
fig,axes = plt.subplots(1,3,figsize=(17,3))
dat.hist(['ra','dec','parallax'],ax=axes,xlabelsize=15,ylabelsize=15,bins=50);
2026210 rows
129.7 MB

png

Looks better.

Note that if we forgot to reset our DataFrame indices with dat.reset_index() earlier, then calling the dat.drop() method would have removed all of the duplicate rows that shared an index with one of the negative parallax entries!

Next, we have a few parallaxes which are relatively large and are skewing our histogram. We can verify this by sorting our rows by descending parallax,

dat.sort_values('parallax',ascending=False)
index ra ra_error dec dec_error parallax parallax_error
1035691 91636 176.937519 0.462309 0.799460 0.281370 295.803638 0.535664
636544 97084 280.683855 0.920717 59.634628 0.292361 286.391028 0.832375
636545 97085 280.683792 0.389218 59.637860 0.301627 283.536922 0.407929
94182 94182 4.612109 0.287586 44.024673 0.367812 280.740075 0.305500
1976175 88065 319.296181 0.368213 -38.872138 0.370400 251.138192 0.568683
1157706 78786 247.574841 0.420255 -12.667522 0.251275 232.285621 0.487817
377161 107431 264.104293 0.445606 68.333850 0.405466 220.024758 0.393103
1431397 82747 176.454907 0.178622 -64.842957 0.222308 215.782333 0.272784
733609 59284 343.323837 0.881040 -14.266502 0.789781 214.051333 0.970971
1972089 83979 323.391271 0.234962 -49.012404 0.297190 201.118291 0.511675
288435 18705 176.938348 0.395537 78.693164 0.325769 191.392475 0.388188
139634 4769 67.808490 0.633925 58.968549 0.474917 181.504752 0.622370
240557 105692 103.700371 0.582028 33.266518 0.694975 180.318422 0.836667
920809 111619 82.867329 0.616180 -3.685946 0.467683 177.082573 0.984590
686004 11679 293.097120 0.693566 69.653934 0.976061 173.600696 0.910296
802636 128311 92.643620 0.094764 -21.867624 0.203850 172.661342 0.291133
1139027 60107 289.227813 0.254156 5.163346 0.253736 169.432737 0.363420
747797 73472 357.306328 0.353555 2.397187 0.147146 169.092433 0.382738
1823379 70134 233.047362 0.395686 -41.279888 0.211157 168.759088 0.561607
1996551 108441 303.477084 0.299771 -45.164686 0.363286 162.235088 0.507104
1918530 30420 218.568533 0.555731 -12.517089 0.682684 159.132195 0.935274
264258 129393 138.592033 0.252283 52.683613 0.333815 158.956745 0.392579
1918218 30108 229.856644 0.306613 -7.722680 0.246396 158.642895 0.354875
344539 74809 246.355682 0.600227 54.303386 0.538926 154.083141 0.596926
1158736 79816 253.851650 0.265568 -8.326327 0.186491 153.695396 0.350193
1343870 130085 45.462532 0.952523 -16.594458 0.667242 148.510255 0.941835
759622 85297 344.140522 0.425399 16.552248 0.316028 145.981548 0.513801
1042468 98413 162.713204 0.792804 6.804717 0.593135 143.553036 0.971027
657982 118522 313.332456 0.251752 62.151173 0.386273 142.165443 0.310414
1163473 84553 261.436049 0.181358 2.106487 0.184051 129.264809 0.264233
... ... ... ... ... ... ... ...
1378204 29554 159.261422 0.195170 -63.488568 0.144476 0.000344 0.387339
2052194 29219 315.726278 0.411026 -16.420915 0.206001 0.000337 0.946564
1911692 23582 220.897611 0.701477 -21.006408 0.243206 0.000327 0.620934
1846402 93157 195.215629 0.230604 -57.582006 0.153829 0.000307 0.426777
718206 43881 29.042201 0.317048 -11.264366 0.220743 0.000288 0.423097
1795871 42626 263.781774 0.763716 -48.763230 0.575749 0.000280 0.341724
1851295 98050 198.723578 0.426686 -54.935028 0.640434 0.000280 0.717644
946606 2551 91.283731 1.018862 9.754557 1.125462 0.000256 0.862931
1371232 22582 151.228626 0.534664 -71.036046 0.440814 0.000256 0.671848
525558 120963 356.100629 0.500013 57.333236 0.328623 0.000246 0.727521
553476 14016 296.077076 0.220440 29.927156 0.258307 0.000236 0.368406
1611666 128151 112.045996 0.200702 -20.038448 0.377668 0.000236 0.377813
1647699 29319 119.261052 0.161939 -23.057956 0.244735 0.000233 0.451276
660269 120809 335.510747 0.218039 58.719272 0.137337 0.000201 0.263607
1192785 113865 269.628558 0.218763 13.198407 0.241046 0.000200 0.251184
1957352 69242 316.758525 0.147774 -49.610813 0.168307 0.000180 0.293074
1167595 88675 254.174149 0.366363 4.440060 0.413614 0.000177 0.340474
1141256 62336 299.819557 0.319539 6.330394 0.274028 0.000174 0.575161
1561195 77680 123.655289 0.342083 -35.202962 0.573828 0.000127 0.605542
901728 92538 117.037781 0.274975 10.822513 0.101694 0.000126 0.396744
855942 46752 111.823585 0.365425 -5.714403 0.246726 0.000100 0.602974
155879 21014 25.187795 0.155206 60.032685 0.168362 0.000093 0.295293
520903 116308 355.046848 0.330697 52.713712 0.264468 0.000080 0.569780
81991 81991 28.898533 1.133123 43.909136 0.180735 0.000055 0.521164
1893148 5038 201.621386 0.294549 -27.042397 0.118924 0.000045 0.270806
516735 112140 339.538380 0.605128 50.106226 0.498631 0.000041 0.661016
1012417 68362 163.932838 0.487281 -14.355836 0.239042 0.000037 0.817324
868189 58999 117.015646 0.507059 1.965003 0.308444 0.000036 0.638530
75437 75437 40.292760 0.337872 37.188368 0.179784 0.000031 0.284505
1731981 113601 183.534339 0.297646 -66.279400 0.144527 0.000002 0.442829

2026210 rows × 7 columns

A parallax of ~300 mas corresponds to a distance of about 3 pc. So these stars are very close to the Sun. We'll be primarily interested in stars that are >40 pc away, so let's drop any stars closer than 20 pc.

dat = dat.drop(dat[1./dat.parallax < 0.02].index)
print('{:d} rows'.format(len(dat)))
print('{:.1f} MB'.format(dat.memory_usage(index=True,deep=True).sum()/1e6))
fig,axes = plt.subplots(1,3,figsize=(17,3))
dat.hist(['ra','dec','parallax'],ax=axes,xlabelsize=15,ylabelsize=15,bins=50);
2025681 rows
129.6 MB

png

We could have also made these two cuts to our data at the same time

dat  = pd.read_hdf('alldata.hdf',columns=['ra','ra_error','dec','dec_error','parallax','parallax_error']).reset_index()
bad_indices = dat.parallax < 0   # Negative parallax indices
bad_indices = bad_indices | (1./dat.parallax < .02)  # OR operation with distances less than 30 pc
dat = dat.drop(dat[bad_indices].index)


print('{:d} rows'.format(len(dat)))
print('{:.1f} MB'.format(dat.memory_usage(index=True,deep=True).sum()/1e6))
fig,axes = plt.subplots(1,3,figsize=(17,3))
dat.hist(['ra','dec','parallax'],ax=axes,xlabelsize=15,ylabelsize=15,bins=50);
2025681 rows
129.6 MB

png

Finding some known clusters

Now that we have our data, let's find some clusters! Our procedure will be,

  • Input the (ra,dec,distance) position of a known star cluster.
  • Find the Gaia data within a few degrees of that.
  • Use a clustering algorithm to find the cluster (and any other clusters) in that field.

To save some time, I've coded up a Python class to hold all of the useful information and methods for a known star cluster. You should take some time to get a sense for how the class works. The class does several things:

  • Given the (ra,dec,distance) position of the cluster, it uses Astropy to convert this into a Galactocentric position, which we'll use to compare stellar positions in real space, as opposed to just on the sky.

  • The method .lims(), will calculate which of the given stars lies within a certain distance of the cluster center. You can see that it uses similar slicing operations that we saw earlier to return a subset of the data.

  • The method .plot_field() will make a scatter plot of the local field on the sky.

from astropy import units
import astropy.coordinates as coord
import numpy as np
class Cluster():
    def __init__(self,name,ra,dec,distance,size):
        """Initialize a cluster with the (ra,dec,distance) center and radial size."""
        self.name = name
        self.distance = distance
        self.coord = coord.SkyCoord(ra=[ra],dec=[dec],distance=[distance]*units.kpc)
        self.pos = (self.coord.ra[0]/units.deg,self.coord.dec[0]/units.deg)
        self.posg = self.coord.transform_to(coord.Galactocentric)
        self.posg = (self.posg.x[0]/units.kpc, self.posg.y[0]/units.kpc,self.posg.z[0]/units.kpc)
        self.size = size
        self.sizeg = np.deg2rad(self.size)*self.coord.distance[0]/units.kpc

    def lims(self,dat,nrad=5,sphere=True):
        if sphere:

            ind = (abs(dat.ra-self.pos[0])<self.size*nrad)&(abs(dat.dec - self.pos[1])<self.size*nrad)
            ind_dist = abs(1./dat.parallax * (1 + dat.parallax_error/dat.parallax) - self.distance) < .5*self.distance
            ind_dist |= ( abs(1./dat.parallax * (1 - dat.parallax_error/dat.parallax) - self.distance) < .5*self.distance)
            ind &= ind_dist
            print('{:d} objects in {:d} degree radius'.format(len(dat.ra[ind]),nrad))
        else:
            ind = (abs(dat.x - self.posg[0])<self.sizeg*nrad)&(abs(dat.y - self.posg[1])<self.sizeg*nrad)&(abs(dat.z - self.posg[2])<self.sizeg*nrad)
            print('{:d} objects in projected {:.1f} pc radius'.format(len(dat.ra[ind]),self.sizeg*nrad))
        return ind
    def plot_field(self,dat,nrad=5,sphere=True):
        ind = self.lims(dat,nrad,sphere)
        fig = plt.figure(figsize=(8,6))
        ax = fig.add_subplot(111)
        ax.tick_params(labelsize=15)
        img=ax.scatter(dat.ra[ind],dat.dec[ind],c=1e3/dat.parallax[ind],s=6,cmap='jet')
        ax.set_title('{}  D={:d} pc'.format(self.name,int(self.distance*1e3)),fontsize=20)
        ax.plot(self.pos[0],self.pos[1],'k*',ms=20)
        x = np.linspace(0,2*np.pi,100)
        ax.plot(self.pos[0] + self.size*np.cos(x),self.pos[1] + self.size* np.sin(x),'--k')
        cbar = plt.colorbar(img)
        cbar.ax.set_ylabel('Distance [pc]',fontsize=20)
        cbar.ax.tick_params(labelsize=15)
        ax.set_xlabel('Right Ascension [degrees]',fontsize=20)
        ax.set_ylabel('Declination [degrees]',fontsize=20)

Let's take the Hyades cluster is our first example. The Hyades is roughly located at (ra,dec,distance) = (4h27m,+15d52m,47 pc), and is ~330 arcmin in size.

Load in the Hyades and show the star field around its position on the sky.

hyades=Cluster('Hyades','4h27m','+15d52m',.047,330./60.)
hyades.plot_field(dat)
1277 objects in 5 degree radius

png

Let's also load in the Pleiades, Beehive, Coma, Alpha Persius clusters,

pleiades = Cluster('Pleiades','3h47m','+24d07m',.1362,110./60)
pleiades.plot_field(dat)
beehive = Cluster('Beehive','08h40.4m','19d59m',.177,95./60)
beehive.plot_field(dat)
coma = Cluster('Coma','12h22.5m','+25d51m',.086,7.5)
coma.plot_field(dat)
alphaper = Cluster('Alpha Persius','03h26.9m','+49d07m', .5*(.171+.200),185./60.)
alphaper.plot_field(dat)
1946 objects in 5 degree radius
2591 objects in 5 degree radius
8184 objects in 5 degree radius
8484 objects in 5 degree radius

png

png

png

png

The Algorithm

Now that we have our star fields, let's figure out how we're going deterime which stars are a part of a cluster. The basic idea will be to identify regions which are more dense (in phase space) than their surroundings. There are many different algorithms for doing this, and each has their advantages and weaknesses.

Grouping similar things together is a common task in machine learning, and we'll borrow the tools and techniques from that field for our analysis. In particular, we'll be using the popular DBSCAN algorithm.

DBSCAN works as follows. For every point in your dataset DBSCAN will search in a radius of $\epsilon$ for $N$ points. If there are at least $N$ points within this distance, then these points start a cluster. DBSCAN then goes through all of the other data points and looks within a ball of radius $\epsilon$ for $N$ more points. If any of these regions overlap, the clusters are merged. DBSCAN continues to do this until all points are either a part of a cluster, or are a part of the background noise. Here's a simple schematic of how this looks for $N=4$,

Luckily for us, the Python library Scikit-learn has implemented DBSCAN in its clustering module. You can read the documentation for how to use DBSCAN here.

I've written some wrapper functions (modified from the DBSCAN example) to find all of the clusters with min_samples in a radius of eps, and to plot the results in several differnt ways.

from astropy.coordinates import Distance
from sklearn.neighbors import DistanceMetric
from sklearn.cluster import DBSCAN
def get_clusters(pos,eps=.005,min_samples=20,sphere=True):   
    if sphere:
        print('Requiring that there are {:d} neighbors in a {:.1f} degree radius '.format(min_samples, np.rad2deg(eps)))
        print('Using haversine 2D distance on sky')
        dist= DistanceMetric.get_metric('haversine').pairwise(pos[:,:2])
    else:
        print('Requiring that there are {:d} neighbors in a {:.1f} pc radius '.format(min_samples, eps*1e3))
        print('Using Euclidean 3D distance')
        dist= DistanceMetric.get_metric('euclidean').pairwise(pos)
    db = DBSCAN(eps=eps, min_samples=min_samples,metric='precomputed').fit(dist)
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    return labels,core_samples_mask
def cluster_distances(pos,labels,core_samples_mask,known_clusters=None,sphere=True,top='all',ax=None,fig=None):
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    unique_labels = np.unique(labels)
    if ax is None:
        fig,ax = plt.subplots(figsize=(8,6))
    if not sphere:
        print('Please use the ra/dec coordinates')
        return
    counts = np.array([len(labels[labels == k]) for k in unique_labels])
    cinds = np.argsort(counts)[::-1]    
    unique_labels = unique_labels[cinds]
    if top == 'all':
        top = len(unique_labels+1)
    else:
        top += 1
    colors = plt.cm.Vega10(np.linspace(0, 1, len(unique_labels[:top])))
    for k, col in zip(unique_labels[:top], colors):
        if k != -1:
            class_member_mask = (labels == k)
            d = pos[class_member_mask &core_samples_mask,-1]
            nc = len(d)
            if nc>1:
                lbl='{:d} Members'.format(nc)
            else:
                lbl = '1 Member'
            ax.hist(d*1e3,color=col,histtype='step',lw=3,label=lbl )
            ax.axvline(d.mean()*1e3,c=col,ls='--',lw=3)
    ax.set_xlabel('D [pc]',fontsize=20)
    ax.set_ylabel('Counts',fontsize=20)
    ax.tick_params(labelsize=15)
    ax.legend(loc='best',fontsize=15)
    if known_clusters is not None:
        for cluster in known_clusters:
            ax.axvline(cluster.distance*1e3,c='k',ls='--')
def plot_clusters(pos,labels, core_samples_mask,known_clusters=None,sphere=True,top='all',ax=None,fig=None):
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    unique_labels = np.unique(labels)
    if ax is None:
        fig,ax = plt.subplots(figsize=(8,8))

    counts = np.array([len(labels[labels == k]) for k in unique_labels])
    cinds = np.argsort(counts)[::-1]    
    unique_labels = unique_labels[cinds]
    if top == 'all':
        top = len(unique_labels+1)
    else:
        top += 1
    colors = plt.cm.Vega10(np.linspace(0, 1, len(unique_labels[:top])))
    for k, col in zip(unique_labels[:top], colors):
        alpha = 1
        if k == -1:
            # Black used for noise.
            col = 'k'
            alpha = .2

        class_member_mask = (labels == k)

        xy = pos[class_member_mask & core_samples_mask,:2]

        if sphere:
            xy = np.rad2deg(xy)
        ax.scatter(xy[:, 0], xy[:, 1], c=col, s=20,alpha=alpha)

        xy = pos[class_member_mask & ~core_samples_mask,:2]

        if sphere:
            xy = np.rad2deg(xy)
        ax.scatter(xy[:, 0], xy[:, 1], c=col, s=2,alpha=alpha)
        if k != -1:
            ax.plot([xy[:,0].mean()],[xy[:,1].mean()],marker='+',c=col,ms=25,mew=5)
    ax.set_title('Estimated number of clusters: {:d}'.format(n_clusters_),fontsize=20)

    if known_clusters is not None:
        for cluster in known_clusters:
            x = np.linspace(0,2*np.pi,100)
            if sphere:
                ax.plot(cluster.pos[1] + cluster.size*np.cos(x) ,cluster.pos[0] + cluster.size*np.sin(x) ,'--k',lw=3)
                ax.plot([cluster.pos[1]],[cluster.pos[0]],'*',ms=12,c='k')
                ax.set_ylabel('Right Ascension [Degrees]',fontsize=20)
                ax.set_xlabel('Declination [Degrees]',fontsize=20)

            else:
                ax.plot(cluster.posg[0] + cluster.sizeg*np.cos(x) ,cluster.posg[1] + cluster.sizeg*np.sin(x) ,'--k',lw=3)
                ax.plot([cluster.posg[0]],[cluster.posg[1]],'*',ms=12,c='k')
                ax.set_xlim(cluster.posg[0]-cluster.sizeg*5,cluster.posg[0]+cluster.sizeg*5)
                ax.set_ylim(cluster.posg[1]-cluster.sizeg*5,cluster.posg[1]+cluster.sizeg*5)
                ax.set_ylabel('X [kpc]',fontsize=20)
                ax.set_xlabel('Y [kpc]',fontsize=20)
    ax.minorticks_on()

    ax.tick_params(labelsize=15)

    fig.tight_layout()
def plot_clusters3d(pos,labels, core_samples_mask,known_clusters=None,sphere=True,top='all',ax=None,fig=None):
    from mpl_toolkits.mplot3d import Axes3D
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    unique_labels = np.unique(labels)
    if ax is None:
        fig = plt.figure(figsize=(8,8))
        ax = fig.add_subplot(111,projection='3d')
    counts = np.array([len(labels[labels == k]) for k in unique_labels])
    cinds = np.argsort(counts)[::-1]    
    unique_labels = unique_labels[cinds]
    if top == 'all':
        top = len(unique_labels+1)
    else:
        top += 1
    colors = plt.cm.Vega10(np.linspace(0, 1, len(unique_labels[:top])))
    for k, col in zip(unique_labels[:top], colors):
        alpha = 1
        if k != -1:
            class_member_mask = (labels == k)

            xyz = pos[class_member_mask & core_samples_mask,:]

            if sphere:
                xyz[:,:2] = np.rad2deg(xyz[:,:2])
            ax.scatter(xyz[:, 0], xyz[:, 1],xyz[:,2], c=col, s=50,alpha=alpha)

            xyz = pos[class_member_mask & ~core_samples_mask,:]

            if sphere:
                xyz[:,:2] = np.rad2deg(xyz[:,:2])
            ax.scatter(xyz[:, 0], xyz[:, 1], xyz[:,2],c=col, s=2,alpha=alpha)
            if k != -1:
                ax.plot([xyz[:,0].mean()],[xyz[:,1].mean()],[xyz[:,2].mean()],marker='+',ms=20,mew=8,c=col)
    ax.set_title('Estimated number of clusters: {:d}'.format(n_clusters_),fontsize=20)
    ax.set_xlabel('X [kpc]',fontsize=20,labelpad=20)
    ax.set_ylabel('Y [kpc]',fontsize=20,labelpad=20)
    ax.set_zlabel('Z [kpc]',fontsize=20,labelpad=20)

    ax.minorticks_on()

    ax.tick_params(labelsize=15)

    fig.tight_layout()    
def prepare_coords(dat,cluster,sphere=True,nrad=5):
    ind = cluster.lims(dat,nrad=nrad)
    pos = coord.SkyCoord(ra=np.array(dat.ra[ind])*units.degree,dec=np.array(dat.dec[ind])*units.degree,distance=np.array(1./dat.parallax[ind])*units.kpc,frame='icrs')
    if sphere:
        return np.vstack((np.deg2rad(np.array(pos.dec/units.deg)), np.deg2rad(np.array(pos.ra/units.deg)),np.array(pos.distance/units.kpc))).T
    else:
        pos = pos.transform_to(coord.Galactocentric)
        return np.vstack((np.array(pos.x/units.kpc), np.array(pos.y/units.kpc),np.array(pos.z/units.kpc))).T
def find_cluster(cluster,dat,eps=.005,min_samples=5,nrad=5,top=3):
    pos = prepare_coords(dat,cluster,nrad=nrad,sphere=False)
    labels, core_mask = get_clusters(pos,
                                     eps=eps,min_samples=min_samples,sphere=False)
    fig,axes = plt.subplots(1,2,figsize=(12,6))
    pos = prepare_coords(dat,cluster,nrad=nrad)
    plot_clusters(pos,labels,
                  core_mask,known_clusters=[cluster],sphere=True,ax=axes[0],fig=fig,top=top)
    cluster_distances(pos,labels,
                      core_mask,known_clusters=[cluster],sphere=True,ax=axes[1],fig=fig,top=top)
    fig.tight_layout()
def find_cluster3d(cluster,dat,eps=.005,min_samples=5,nrad=5,top=3):
    pos = prepare_coords(dat,cluster,nrad=nrad,sphere=False)
    labels, core_mask = get_clusters(pos,
                                     eps=eps,min_samples=min_samples,sphere=False)

    plot_clusters3d(pos,labels,
                  core_mask,known_clusters=[cluster],sphere=False,top=top)

To use these you just need to call the find_cluster(cluster,dat,eps,min_samples) function, and specify eps and min_samples. The default values for these are 5 stars within 5 pc, but you should adjust them to see how they affect the final results.

To start, let's look at the Hyades.

%matplotlib inline
find_cluster(hyades,dat,eps=.003) # 5 stars within 3 pc
1277 objects in 5 degree radius
Requiring that there are 5 neighbors in a 3.0 pc radius
Using Euclidean 3D distance
1277 objects in 5 degree radius

png

DBSCAN found 9 clusters, and we're showing just the three largest ones. Each cluster has an associated color and is shown on the sky in the left plot. The center of mass is shown with a cross. The right plot shows the distribution of distances for each cluster, with the average distance shown by the dashed vertical line. Stars that are not a part of a cluster are shown in grey.

We can see that the largest cluster has 66 members and is roughly centered on the sky near the known location of the Hyades. Looking at the distances, we see that the distrubtion is centered close to the Hyades distance, and has a ~5 pc spread. The tidal radius of the Hyades is roughly ~10 pc, so these stars may actually be correctly classified. Note that the other clusers have < 5 members and are probably not signficant.

We can also use matplotlib's 3D plotting cababilities to see the cluster in 3D. For this we'll use the %matplotlib notebook magic to be able to manipulate the image.

%matplotlib notebook
# If the notebook interace doesn't show up, you made need to restart the kernel and run this cell
find_cluster3d(hyades,dat,eps=.003)
1277 objects in 5 degree radius
Requiring that there are 5 neighbors in a 3.0 pc radius
Using Euclidean 3D distance

png

Next we'll do the Pleiades,

%matplotlib inline
find_cluster(pleiades,dat,eps=.0035)
1946 objects in 5 degree radius
Requiring that there are 5 neighbors in a 3.5 pc radius
Using Euclidean 3D distance
1946 objects in 5 degree radius

png

%matplotlib notebook
find_cluster3d(pleiades,dat,eps=.0035)
1946 objects in 5 degree radius
Requiring that there are 5 neighbors in a 3.5 pc radius
Using Euclidean 3D distance

png

More clusters

Try different values of eps and min_samples for the Hyades, Pleiades, and the other clusters below, to see how accurate you can get. Here are some questions to think about:

  • Is there a sweet spot that works well for all of the clusters?
  • Or do you have to finely tune the parameters each time?
  • Is there a quantitative way to find the best eps and min_samples?
  • How would you incorporate the positional uncertainties for each star in the clustering analysis?
  • Can you search through the entire dataset to find new clusters?
alphaper = Cluster('Alpha Persius','03h26.9m','+49d07m', .5*(.171+.200),185./60.)
coma = Cluster('Coma','12h22.5m','+25d51m',.086,7.5)
IC2391 = Cluster('IC 2391','08h40.6m','−53d02m',.175,50./60)
IC2602 = Cluster('IC 2602','10h42m57.5s','−64d23m39s',.167,50./60)

Releases

No releases published

Packages

No packages published